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standing how biological systems process information. We take a physicist's approach of looking for 
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out necessarily modeling all (mostly unknown) microscopic details; the example that is developed 
throughout the notes is transcriptional regulation in genetic regulatory networks. We present 
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biological networks, and build generative and predictive models of regulatory processes. 
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Introduction 

In cell biology, neuroscience, as well as the study of 
collective behavior of organisms, networks of interacting 
agents or elements orchestrate the cellular, organismal or 
group response to the changes in the environment or the 
internal conditions of the system. For example, in cells, 
signaling proteins on the membrane can detect external 
chemicals and respond by chemically modifying other in- 
tracellular proteins, leading to a cascade of activity that 
can end with up- or down-regulation of the appropri- 
ate genes. Similarly, a genetic regulatory network com- 
prises a set of regulatory proteins, called transcription 
factors, that find and bind special non-coding regions on 
the DNA, thus causing a change in the expression levels 
of the regulated genes. In the nervous system, signals are 
propagated as "digital" action potentials: each neuron in 
the human brain receives synaptic input from other neu- 
rons and integrates it into its own decision to spike or not. 
Finally, flocks of birds, aggregating collections of single- 
celled amoeba (Dictyostelium) , schools of fish and even 
groups of people can exhibit collective behaviors that are 
not necessarily trivially understood from the properties 
of single group members. 
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As physicists, we often see collective behaviors as 
emerging from interactions among basic "simple" ele- 
ments. Yet in biology, even the building blocks of infor- 
mation processing networks are not simple. Proteins can 
have many conformational states that are hard to predict 
from the amino-acid sequence. The integration of infor- 
mation at the enhancer site in metazoan gene regulation 
- that is, how levels of regulatory proteins together deter- 
mine the expression level of the regulated gene - is still 
mostly unknown. And in the nervous system, examples 
in which single units change their information-processing 
properties through adaptation to sensory statistics or 
where the network dynamically modifies its inter-neural 
connections as a consequence of learning, have been stud- 
ied extensively, but are still poorly understood. 

Despite this complexity in network building blocks and 
the fact that processes in biological networks can occur 
on many (not necessarily well-separated) timescales, we 
can still hope to find phenomenological or coarse-grained 
descriptions. There is no underlying theory of biologi- 
cal networks in the sense known to hard physical science. 
But our view is that while they operate and are essen- 
tially constrained by basic physics (some of the examples 
we will see later), the biological networks are also subject 
to evolutionary pressures for function. This is a crucial 
distinction in comparison with inanimate systems, and 
there is hope that the intuition of a "network X being 
driven to perform function Y well" might generate a pre- 
dictive theory for biological network X. For example, in 
the metabolic pathway of Escherichia coli, the function 
that "given the level of nutrients, the bacteria should 
maximize the growth rate" can be mathematically for- 
mulated and actually leads to verifiable predictions about 



the network architecture ( Ibarra et al 2002 1 



In this Introduction, we motivate the lectures by asking 
three questions (iTkacikl |2007| : 



1. When considering biological networks that process 
information, how might one quantify the network 
function in a mathematically concise way? Is it 
possible to derive network properties by optimizing 
for such function, as is the case with metabolic net- 
works? Are there general principles that underlie 
information processing in living systems? 

2. What kinds of measurements can we perform on bi- 
ological information processing networks and, hav- 
ing these measurements, how can they be analyzed? 

3. How is the analysis of such networks different 
from the typical analyses of collective behaviors in 
physics? Which concepts and tools from physics 
can be borrowed to dissect and understand biologi- 
cal networks? 

We'll use transcriptional regulation as a concrete ex- 
ample to illustrate these questions throughout the lecture 
notes. Analogies in neuroscience and other fields will be 
commented on. 



These notes are organized as follows: Section [T] pro- 
vides an introduction to biological information process- 
ing networks by defining the basic terminology and ap- 
proaches, and illustrating why these systems are interest- 
ing to study; Section [ll] discusses the response properties 
of single network elements, e.g. genes or neurons; Sec- 
tion |III| discusses the role, types and origins of noise in 
biological networks; Section |IV| lays the groundwork for 
information theoretic approach, defining quantities such 
as entropy, mutual- and multi-information, synergy, re- 
dundancy etc; Section [V] illustrates how tools from in- 
formation theory can be used to infer models of tran- 
scriptional regulation, i.e. transcription factor - DNA 
interaction; Section |VI| proceeds to extend the tools to 
analyze simultaneous interactions of more than pairs of 



elements in network; and finally, Section VII proposes a 



new information-theoretic principle that can perhaps ex- 
plain the design of several developmental transcriptional 
regulatory networks. 



I. BIOLOGICAL INFORMATION PROCESSING 
NETWORKS: NOISY NONLINEAR DYNAMICAL 
SYSTEMS 



The primary source for this section is Ref (Tkacik et 
|al |2009 ), which provides a more complete bibliography 
and a review of the study of biological networks. 

Let us start by describing several biological networks 
and reviewing some of their general properties: 

Transcriptional regulatory networks. In their 
genomes, organisms contain from several hundred to sev- 
eral tens of thousands of genes. The expression levels of 
these genes are primarily regulated by a set of proteins 
known as transcription factors (TFs), which are encoded 
by a few up to about 10% of genes in the genome. 
TFs bind specifically to short regulatory sequences of 
~ 10 — 20 nucleotides in length, also known as binding 
sites, on the DNA, thereby modifying the expression lev- 
els of regulated genes. TFs can cross- and self-regulate, 
opening up a possibility of feedback regulation. They 
are usually present in nuclei in small, nanomolar range 
concentrations (for a nucleus with several /^m radius, 
these concentrations correspond to several hundred to 
thousands of TF molecules per nucleus). The timescales 
of such regulation span from minutes to hours. Some 
known examples of such regulation are the Lac operon 
in Escherichia coli, the A repressor switch, many exam- 
ples in yeast, genes of early development in Drosophila 
melanogaster (such as bicoid, hunchback, even skipped), 
Hox genes etc. 

Signaling networks. "Sensory proteins," such as 
G-protcin-couplcd receptors, sense various extracellu- 
lar molecules. They respond to incident photons, like 
rhodopsin, and bind neurotransmitters and environmen- 
tal molecules to which we respond by sense of smell. Sim- 
ilarly, in bacteria, histidine kinase proteins (one part of 
two component signaling systems) are also membrane- 
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FIG. 1 A 2 component signaling system used by bacteria to 
transduce information from tlie environment into the cell. A 
receptor protein, histidme kinase or HK, is embedded into 
the membrane and able to bind ligands (shaded pentagons). 
Upon binding, HK undergoes a conformational change and 
transfers a phosphate group (P) to a specific diffusable pro- 
tein, called a response regulator (RR). An activated form of 
this protein, RR* , can act as a transcription factor and regu- 
late target genes (such as gene g in this figure). 



bound proteins that detect their specific ligands. Upon 
binding their Hgands these proteins change confirmation 
and cause a chain of phosphorylation / dephosphoryla- 
tion reactions that chemically modify their target pro- 
teins (e.g. in bacterial two component signaling systems, 
the targets are the so-called "response regulator" pro- 
teins), thus altering their activity. These proteins can be 
present in thousands per cell, and the chemical reactions 
are much faster than in the case of transcriptional regula- 
tion, with equilibration times on the order of milliseconds 
to seconds. The reaction specificity is thought to occur 
via molecular 'lock-and-key' like recognition mechanisms, 
but there exist cases of both unwanted crosstalk and in- 
tentional signal integration, where the same molecular 
targets are modified by various upstream enzymes. 

Neural networks. Neurons transmit signals by prop- 
agating stereotyped volta ge pulses, o r actio n potentials, 
across their membranes ( jRieke et al 1997). These ex- 
citations are driven by ionic currents that flow through 
special proteins embedded in the membrane, called ion 
channels. The speed of propagation along the fibre is 
on the order of meters per second, and the timing preci- 
sion of spikes can below a millisecond. Most neurons are 
all-or-nothing devices: upon receiving input from other 
neurons through its dendrites (which can number into 
tens of thousands, in the mammalian cortex), a neuron 



with some probability either produces a spike and sends 
it down its axon, or not. The axon synapses upon thou- 
sands of other neuron's dendrites. The complexity of neu- 
ral network processing stems from the fact that synapses 
are state- and history-dependent, as their transmission 
probabilities can be adjusted on long timescales by chem- 
ical modification (this is responsible for learning). Neu- 
rons too have complex internal states and exhibit many 
interesting computational capabilities, such as nonlinear 
input summation, adaptation, resonance properties, and 
different regimes of operation dependent on the precisely 
controlled ion channel composition. 

These networks share a number of common features: 

• Biological networks are dynamical systems. The 

relevant timescales are the time on which the input 
fluctuates, the timescale on which single elements 
respond (neural spiking, protein decay rates) and 
potentially the timescale on which the network it- 
self changes its properties (learning in neural net- 
works, change in signaling protein concentrations in 
signaling networks). Networks can be (self-)tuned 
to special operating points (e.g. dynamical crit- 
icality), where new, "emergent" timescales might 
appear in the system. The networks often contain 
positive or negative feedback loops. 

• The wiring in the network is specific. Speciflcity 
can be achieved by spatial organization (neural net- 
works, chromosomal organization, metazoan gene 
regulation) and selective establishment or death 
of connections (neural networks), or by molecular 
mechanisms of recognition (TF-DNA interaction, 
signaling enzymes). 

• The network dynamics is noisy. This is a con- 
sequence of the stochasticity in single molecu- 
lar events at low concentrations of the relevant 
molecules. Neuronal spikes are not completely re- 
producible even if the same stimulus is played to 
the neuron over and over again, because fluctua- 
tions in opening and closing of a flnite number of 
ion channels in the membrane are not negligible. 
The nanomolar concentrations of TPs in the cell 
mean that the precise timing when a TF finds and 
binds a regulatory site on the DNA is a random 
variable which results in stochastic gene activation. 
We discuss the importance of noise below. 

• The network elements are nonlinear. There are 
saturation effects, as in when all enzymes of the sig- 
naling pathway are operating at capacity, or a gene 
is fully activated. Neurons themselves are excitable 
systems, that either don't respond or respond fully 
with a spike. Functionally, nonlinearities enable the 
systems to make "decisions" (e.g. by thresholding) 
and to re-represent their inputs in nontrivial ways 
(e.g. not by simple linear, rotation-like transforma- 
tions). 
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A. Frameworks for describing biological networks 



Various formal levels of description have been used to 
analyze biological networks, and different approaches em- 
phasize some aspects enumerated above at the expense 
of the others: 

Topological models discuss networks in terms of 
wiring diagrams. The focus is on summarizing experi- 
mental data about the patterns of interconnection using a 
(possibly directed) graph. In genetic regulation, a partic- 
ular kind of arrow in a wiring diagram might imply that 
gene A is activating gene B, while another kind might 
imply repression. Topological models are concerned with 
statistical properties of such graphs, in particular, in how 
they differ from various simple models of random graphs. 
For instance, global statistical properties, such as node- 
degree distributions, clustering coefficients, mean path 
lengths etc have been studied for metabolic and regula- 
tory networks, and the network of interconnections of all 
302 neurons of C elegans^ . The advantage of this method 
is that reasonably complete regulatory network diagrams 
exist, and their properties can be compared to other net- 
works, including engineered ones (e.g. transport, internet 
routing etc). A parallel line of inquiry has shown that cer- 
tain local graph connectivity features, known as motifs^ 
are overrepresented in, for example, transcriptional reg- 
ulatory networks compared to randomized network en- 
sembles ( |Shen-Orr et al[ |2002[ ); see Fig. [2] One should 
bear in mind, however, that the connectivity graph is a 
drastic oversimplification of the true network: without 
knowing (i) the kinetic parameters on the arrows, (ii) 
how the regulatory arrows converging onto a single node 
combine in their effects, and (iii) what are the internal 
states of each of the nodes in the regulatory graph (e.g. 
often the same node represents both mRNA and its pro- 
tein product, both of which have their own dynamics with 
delays), one can not predict from the connectivity alone 
how the network will behave, although certain classes of 
dynamical behavior can be excluded. 

Interesting findings pertaining to biological networks 
described by connectivity graphs have been: (i) their 
scale free degree distribution, with the probability of 
a node to be connected to k neighbor s bein g P{k) ^ 
A;-T (with 7-2-3) ( [Barabasi era!} |2004[ ), and the 
consequent identification of huh nodes (often essential 
proteins); (ii) robustness to breakup of the connected 
component with respect to removal of most nodes, but 
fragility with respect to removal of hubs; (iii) "small 
world" architecture, with short mean path lengths be- 
tween pairs of nodes, and high clustering coefhcients 
(Strogatz 2001); (iv) hierarchical yet clustered nature 



(no preferred size of the cluster, consistent with scale-free 



FIG. 2 Graph representation of simple regulatory networks, 
a) Transcription factors ci and C2 regulate genes g\, g2 and 
g-i, arrows represent activation, blunted lines represent repres- 
sion, ci activates gi and g2, while C2 auto-represses itself and 
genes g2 and g^. gz laterally activates g2. b) Feed-forward 
loop, one of the motifs found to be overrepresented in the 
transcriptional regulatory network of E coli: gene A activates 
B which activates C, but A also directly activates C. 



property) ( [Ravasz et al 2002); (v) "dissociative struc- 
ture" , in which hubs are often not connected to each 
other (in contrast to social networks, where friends with 
lots of friends are often friends among each other) . 

Boolean models. Here, the network is represented 
as a collection of boolean variables {(Xi}, which can 
take "on" (cr = I) and "off" [a = 0, sometimes 
(T = —1) values, and a set of dynamical update rules, 
ai{t -\- \) = f{aj{t)), that evolve the system forward in 
time in discrete steps. The variables might be thought 
of as two-state genes (or neurons), and the update rules 
are combinatorial (Boolean) expressions involving input 
TFs on the promoters (or dendritic inputs in neurons). 
Generalizations to more than 2 states have been used, 
e.g. in modeling Drosophila developmental gene network 
(Sanchez et al 2001). This method scales well to simu- 



^ Interestingly, the full wiring diagram of all the neurons in a rel- 
atively simple animal has not really brought us closer to under- 
standing how the functions emerge in this model nervous system. 



lations of many genes and emphasizes the dynamic and 
nonlinear nature of the network, but can introduce se- 
rious artifacts due to synchronous update rule. In neu- 
roscience, the Hopfield network is of the similar form, 
with asynchronous update, and provides a clean theoret- 
ical model of associative memory that is able to recall 
a stored binary pattern from any closely matching sub- 
pattern (Hopfield 1982). In that case the states of N 
neurons are given by {cTi}, the update rule is downhill 
descent ai ^ sgn(^^- Jijcrj + hi) evaluated under the 
network wiring parameters {hi, Jij}, the input pattern is 
the initial state of the network {(Ti}o, and the final state 
is the attractor of the dynamics. The memories are K 
binary patterns /i — 1. . .K, stored into Jij by the 
prescription Ji^ = R-^ C^j'- 

Interesting applications of Boolean network modeling 
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include the description of the Drosophila gap-gene system 
in which 4 gap genes respond to 3 maternal TF inputs 
with spatially varying profiles; this model could describe 
correctly the expression patterns in a number of known 
mutants (Sanchez et al 2001[ ). Another interesting case 



involved describing the budding yeast cell cycle driven 



by a network of 11 interacting nodes (Li et al, 2004 1. 



In this network the topology induces a robust sequence 
of state transitions (growth, DNA duplication, pause, 
division), triggered by a "cell-size" checkpoint; robust- 
ness here refers to the fact that the dynamical trajectory 
through the states remains unchanged upon perturba- 
tion of update rules, small changes in topology and the 
parameters. 

Dynamical systems. In the field of dynamical sys- 
tems analysis, one usually assumes that each network 
element can have a certain degree of "activity" (expres- 
sion level for a gene, fraction of activated proteins of a 
given type in a signaling network, firing rate of a neuron) . 
These activities gi are treated as continuous variables un- 
dergoing network dynamics. As a simple example one 
could write: 



dgi 
dt 




(1) 



here, the first term describes relaxation towards equi- 
librium with timescale r, while the second "activation" 
term adds up contributions from other network elements 
weighted by a connection matrix J^j, and a possible di- 
rect input li to element i. F is a nonlinear function, 
often saturating (or sigmoid). In the case of gene regu- 
lation, gi could be the expression level of gene j, t the 
protein decay rate, the contributions of transcription 
factor j to the expression of gene i, Ii the direct exter- 
nal influence on i by e.g. induction, and F would be 
the promotor input/output function, discussed later. Of 
course more complex and realistic examples are possible, 
potentially including spatial effects due to, for example, 
diffusion of TF molecules in the cells. In the study of 
dynamical systems one focuses on the search for univer- 
sality classes of dynamical behaviors, and looks for limit 
cycles and steady states in dynamics and their sensitivity 
to network control parameters [such as in Eq ([T])]. 

Well-known applications of this approach involve mod- 
eling the chemotaxis module of Escherichia coli ( [Bray et 



al 1993 1 , a model of cell cycle control in fission yeast (ap- 



proximately 10 proteins and 30 rate constants) (Novak et 



al[ 1997), and the circadian clock in mammals (a system 
of about 20 rate equations) that exhibits autonomous 
oscillations and reproduces the entrainment of oscilla- 



tio ns thr ough light-induced gene expression (Leloup et 
[al 2003 1 . This approach is popular also for modeling of 
smaller and simpler (sometimes synthetic) systems, such 
as the repressillator ( Elowitz et al[ 2000) or the Min os- 
cillator in bacteria. 

Probabilistic models and stochastic dynamics. 
Probabilistic models try to capture the noise inherent in 



biological processes and experimental protocols. For a set 
of (for example) binary variables {ci} that denote the si- 
multaneous expression levels of genes (or firing / silence 
states of the neurons), one class of probabilistic models 
tries to find good approximations for P{{ai}\C), that is, 
the probability distribution that all genes (or neurons) 
are in a particular configuration {cr^} given the external 
conditions (or stimulus) C. This is then compared to 
experimental data, which is a sample of many measure- 
ments of {ai} across cells or time. P{{ai}\C) gives a gen- 
erative model from which synthetic data that resembles 
real measurements can be drawn. Various approaches 
differ in what forms for P are assumed. We will mention 
Bayesian network inference and maximum entropy mod- 
els in Section |VI] A second class of probabilistic models 
attempts to generalize dynamical equations, e.g. Eq ([I]). 
If the noise is not too big, the simplest way is to write: 



dgi 

dt 



F 



+iA+uty, (2) 



where (C(i)^(i')) = '^T{{gj})d{t-t') is the Langevin noise 
strength, i.e. a random force whose variance T might de- 
pend on the state of the system (braces denote averaging 
over many realizations of the noise time series). Tech- 
niques from statistical mechanics and diffusion can then 
be used not only to calculate the mean dynamics gi{t) 
of this system, but also the evolution of the "noise" or 
the fluctuation around the mean, characterized by the 
covariance matrix of all gi, Cij(t,t') — {Sgi{t)Sgj(t')). 
When concentrations of constituents are low (or when 
we are interested in single neural spikes), the continu- 
ous description with gi must break down and the state 
space is replaced by {n^}, the integer counts of individual 
molecules such as DNA or TFs. One switches to describ- 
ing the full evolution of the probability distribution using 
the Master equation (van Kampen 2007): 



dP{{ni}\t) 
dt 



= LP({nJ|t), 



(3) 



where L is some linear evolution operator. 

Finally, there are techniques like the Stochastic Simu- 



lation Algorithm (SSA) of Gillespie ( |Gillespie[ [2007[ ) and 

generalizations to spatially extended systems, where the 
discrete stochastic dynamics can exactly be simulated at 
the expense of slow execution speeds. One of the pioneer- 
ing studies with SSA was the simulation of the A lysis / 
lysogeny switch (McAdams et al 19971, where the simu- 



lation tried to reproduce the fraction of lysogenic phages 
as a function of multiplicity of infection. 



B. A simple regulatory element formulated In different 
frameworks 

Let's start by discussing the simplest possible example 
of genetic regulation; we will develop this example as we 
progress through the lectures. 



6 




FIG. 3 The simplest regulatory graph, where an input tran- 
scription factor at concentration c regulates the output ex- 
pression level g by binding to a binding site n, which can be 
empty or occupied. Since c acts as an activator, an occupied 
site results in transcription and translation of g. 



Let the transcription factors (TFs) be present at con- 
centration c in the cell. On the DNA, there is a single 
specific binding site that can be occupied or empty; we'll 
denote this occupancy with n{t). When the site is occu- 
pied, the regulated gene will get transcribed into mRNA, 
which is later translated into proteins whose count we de- 
note by g{t), at the combined rate that we denote by R. 
The proteins are degraded with the characteristic time t. 
In this case, our TF thus acts as an activator, see Fig. |3] 
Here and afterwards we will refer to the transcription 
factor c as an input, and the regulated gene product g as 
output. 

This model discards a lot of molecular complexity: 
there is no explicit treatment of diffusion of TFs, no 
non-specific binding, no separate treatment of mRNA 
and protein, no chromatin opening / closing etc; in ad- 
dition, we group many multi-stage molecular processes 
(such as TF binding, RNAP assembly, processive tran- 
scription etc) into single steps. Thus, our model is a 
gross but tractable oversimplification. As an illustration, 
let us formulate it in all of the mathematical frameworks 
discussed above. 

The topological diagram for this example is simple c — 
g. In the case of Boolean network models, the state space 
of this network is {eg} G [0,1]^, indicating that both 
the transcription factor c and the regulated gene g can 
be "on" or "off." The update rules are trivial: c{t+ 1) = 
c{t),git + l) = c{t). 

Treating now concentrations c and g as continuous, we 
can describe the same regulatory process by the set of 
differential equations: 



dn 
'dt 

dg_ 

dt 



fc_|_c(t)(l — n) — fc_n 
g + Rn. 



(4) 
(5) 



Equation Q is an equation for occupancy n, which is a 
number between and 1. Nominally, the site can only 
be fully empty or occupied, but in this approximation, 
we treat it as a continuous variable that can be inter- 
preted as a "probability of the site being bound." fc+c is 
the TF-concentration-dependent on-rate, and fc_ is the 
first-order off-rate. Often, it is assumed that there is a 
separation of time scales: the first equation for occupancy 
equilibrates much faster than r, meaning that the mean 
occupancy 



n{t) 



k+c{t) 



k+c{t) + 
can be inserted into Eq ^ to get 



dg 1 f,-, . _k+c(t) 

-^--g[t) + R^—^-^_ 



(6) 



(7) 



In this simple case without feedback, the approach to the 
equilibrium at fixed c is exponential with the rate t, and 
the steady state is simple: g = Rrn. Equation Q has 
precisely the form of Eq ([T]) with F having a signioidal 
shape. We discuss in the next section how the particular 
sigmoidal regulation functions are connected to equilib- 
rium statistical mechanics of this system, how noise can 
be added by an introduction of the Langevin force into 
Eq ([5]), and why the assumption of fast equilibration of 
n strongly influences the noise. 

Finally, if we wanted to fully capture the noise in this 
model, the object of our inquiry would be Pn{g\t, c): the 
time-dependent joint probability of observing g molecules 
of the resulting gene and the state of the binding site be- 
ing n — 0,1 (empty, occupied), given some concentration 
of the input c. One can marginalize over n to get the 
evolution of probability of observing g output molecules: 
P{g\t,c) — J2n=o 1 Pn{g\t,c). Writing down the Master 
equation and for simplicity suppressing the parameters 
(c, t) on which all terms P{g\t, c) are conditioned, we find: 



dPojg) 
dt 

dPijg) 

dt 



^Po{g + 1) + k_P,ig) - (fc+c+ ^)Fo(5) 
T r 

^Pi{g + + RPiig - 1) + k+cPo{g) - 



- {k^ + + R)P,{g); 

T 



(8) 



the reader should recognize degradation-related terms 
(proportional to l/r), the protein production terms (pre- 
fixed with R and present only in the case when the gene 
is on, i.e. n = 1) and the switching terms of the pro- 
moter containing fc+c and fc_, which couple the n = to 
n = 1 states. In this simple case, the equilibrium distri- 
bution can be solved by zeroing out the left-hand side of 
Eq (|8|. This yields an infinite dimensional system in g 
that can be truncated at some (/max ^ Rt; we would end 
up with a homogenous linear system that can be supple- 
mented by a normalization condition J2n=Q 1 '^g'=o ~ 
which can be inverted and solved for steady state Pn{g). 
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More sophisticated methods are available when the num- 



ber of genes grows and they are interacting ( Walczak et 
[al 2009). Note that in this example we treated the g as 
discrete, but c is still a continuous input parameter (not 
a variable whose distributions we are also interested in) . 

Finally, let's mention the Gillespie algorithm. For this 
algorithm we start with enumerating all reactions and 
their rates: 



c + n — > cn 
cn — > c + n 
cn cn + c, 
.9^0 



R 
-1 



(9) 



One initializes the state of the system as a vector 
(c, n, cn, g) of integer counts of molecular species (here 
cn denotes a molecular complex of a c molecule bound 
to the promoter; there can only be or 1 n and cn, and 
you can quickly convince yourself that n = 1 — cn) . Then 
the probability per unit time of each of the 4 reactions is 
the product of the rate constant and the number of reac- 
tants properly normalized by the relevant volume. The 
algorithm randomly draws the next reaction consistent 
with the probabilities per unit time, updates the state 
of the system and repeats. This algorithm is exact for 
well-mixed systems, but (i) can be slow in case there 
are fast and slow reactions in the system; (ii) one needs 
to sample many simulation runs to accumulate the noise 
statistics. 

From the presented example it is clear that the fully 
stochastic dynamical description can be relatively com- 
plicated even for a very simple system. To proceed and 
be able to connect to data, we will in these lectures drop 
the time dependence and only focus on the steady state, 
while emphasizing the nonlinear and noisy nature of the 
system. Sections [ll] and |III| therefore use the Langevin 
description to treat noise in a dynamical system, and Sec- 



tion VI presents a probabilistic maximum entropy model 
of an interacting network. Our assumption to only study 
the steady state will preclude us from discussing network 
phenomena that are intrinsically dynamic, e.g. the cell 
cycle clock, the circadian clock, or the detailed nature 
of excitability in neurons and Dictystelium cell cultures. 
But for many biologically realistic cases, such as in de- 
velopmental biology, or in many experimental settings, 
such as measuring the gene response to constant levels of 
inducer, the steady state approach is useful. 



II. NETWORK BUILDING BLOCKS AS INPUT / 
OUTPUT DEVICES 

In this lecture we will first explore simple thermody- 
namical models of gene regulation, by studying how the 
concentration of a transcription factor relates to promo- 
tor occupancy and thus to the expression level of the reg- 
ulated gene. We will then look at one possible model for 
combinatorial regulation and introduce the notion of the 



input /output relation. The section finishes by briefly sur- 
veying the experimental measurements of input/output 
relations. 



A. Simple models of regulation: Hill functions 

Let's first revisit our simple description of gene regu- 
lation from Eqs (4[5). We found that in steady state the 
occupancy of the promoter is n(c) = k+c/{k+c + fc_), 
where fc+ and fc_ are on- and off rates, respectively. 

Since this is an equilibrium system, we can ask for the 
equivalent statistical mechanics description. Suppose we 
have a site n that can be occupied or full. In case it is 
occupied, there is a binding energy E favoring the oc- 
cupied state, relative to the reference energy in the 
unbound state. But in order to occupy the state, one 
needs to remove one molecule of TF from the solution. 
The chemical potential of TFs, or the free energy cost 
of removing a single molecule of TF from the solution, 
is = fc^Tlogc, where c is the TF concentration, mea- 
sured in some dimensionless units of choice. In statisti- 
cal physics we can calculate every equilibrium property 
of the system if we know how to compute the partition 
sum, which is Z ~ '^^^-PiEi-tmi) ^ where the sum is 
taken over all possible states of the system (in our case 
binding site empty and binding site occupied), Ei is the 
energy of the system is the state i, and n^ is the number 
of molecules in the system in the state i. 

In our case of a single binding site, the partition sum 
is over the empty (n = 0) and occupied (n = 1) state: 



-/3(iS-M) 



1, 



(10) 



where /3 — \/{kBT), T is the temperature in Kelvin and 
ks is the Boltzmann constant. The probability that the 
site is occupied is then 



P{n = 1) = le-^(^-^) 
Inserting the definition of /i, we get 
P(n = 1) = 



c + Kd' 



(11) 



(12) 



where we write Kd = exp(/3i?). But n = 1 • P(n = 
1) -I- • P{n = 0) = P(n = 1), so by comparing with Eq 
Q we can make the identification 



Kd 



k. 

kl 



(13) 



which connects our statistical mechanics and dynamical 
pictures. Note that fc_ is measured in units of inverse 
time, s^^, k^ is measured in units of x [conc]~^ 
(but by convention we here measure concentration in di- 
mensionless units, as in /i = ksTlogc), so Kd has units 
of concentration. 

Suppose we make the model somewhat more compli- 
cated: let us have two binding sites, which together will 
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constitute a system with 4 possible states of occupancy: 
both sites empty, either one occupied, and both occupied, 
which we'll write compactly as (00, 01, 10, 11). Let's also 
assume that there is cooperativity in the system - if both 
sites are occupied, then there will be an additional favor- 
able energetic contribution of e to the total energy of the 
state (11). Finally, when promoters can have multiple 
internal states, we need to decide which state is the "ac- 
tive" state, when the gene is being transcribed^; here we 
pick the state (11) as the active state. 
The probability of being active is then 



P(ll) 



-2E-£+2fi 



(14) 



where we use the units where /3 — 1, that is, we express 
the energies and chemical potential in thermal units of 
fcsT. If the cooperativity is strong, i.e. the additional 
gain in energy e is larger than the favorable energy of 
putting a molecule of TF out of the solution onto the 
binding site, e <C /J. — i?, we can drop the middle term of 
the denominator in Eq (14) and simplify it into: 



P(ll) 



(15) 



with Kd = exp[/3{E+e/2)], where again we have used the 
definition of chemical potential ^. This problem with 2 
binding sites and 4 states of occupancy also has a comple- 
mentary dynamical picture, which is already quite com- 
plicated, see Fig. |4j 

Readers used to molecular biology models of ge ne reg - 
ulation will recognize sigmoidal functions in Eqs ( 4|15 1, 
also known as Hill functions, with a general form (see 
Fig.§: 



n(c) 



(16) 



where the dissociation constant is interpreted as the 
concentration at which the promoter is half induced, and 
h is known as the cooperativity or Hill coefficient, usually 
interpreted as the "number of binding sites" Here we 
have shown how such phenomenological curves arise from 
simple statistical mechanics models of gene regulation 
with cooperative interactions. For repressors, one can 
show that n{c) ^ Kj^/ic'' + K^). 

Before proceeding, let's inspect more closely the rela- 
tion between the dynamical rates and the binding en- 
ergy for a single site: k-/k+ = exp(/3i?). As we have 



In general, each internal promoter state could have its own tran- 
scription rate, but often one state is picked as having the max- 
imal transcription rate, and the other states represent the gene 
being "off" or expressing at some small, "leaky" rate of gene 
expression. 

In case where there is self-activation of the gene, i.e. gene g can 
activate its own transcription in addition to being activated by 
the input c, that conclusion is incorrect. 




FIG. 4 The transitions in the model with 2 binding sites 
and 4 occupancy states, (00,01,10,11). The binding of an 
additional molecule of TF happens at a rate fe+c, whereas 
unbinding rates are state dependent: a singly occupied pro- 
moter returns to the non-occupied state with a rate fc_, but 
the doubly occupied promoter loses a molecule of TF with the 
rate k_. This difference is due to cooperativity, where the 
binding of one molecule stabilizes the binding of the other, 
and this makes the unbinding rates state dependent. The 
"active" state is (11) in the lower right corner. We leave 
it as an exercise for the reader to write down the dynami- 
cal equations dnoo/dt — ■ ■ ., driQi/dt = . . . etc, observe that 
noo + noi + nio -I- nii = 1, and compute the steady state ac- 
tivation if cooperativity is strong, nn. As in the case of a 
single binding site, this expression can be connected to the 
thermodynamic result of Eq ( 14 1 . 



shown in Eq (131, this equality is required by detailed 
balance if thermodynamic and kinetic pictures are to 
match. Molecularly - and we will devote the entire lec- 
ture of Section |V] to this - the energy of binding E in the 
case of transcription factor - DNA interaction depends 
on the sequence. So if we were to vary the sequence 
and binding energy E would change, which of the two 
rates, fc_ or fc+ would vary as a result? In general one 
cannot answer this question without knowing in detail 
the sequence of molecular transitions that happen at the 
binding site. However, there is a useful limit, called the 
diffusion-limited on rate, that is often applicable. In this 
regime, the limit to how quickly the TF can bind is given 
by the speed at which it can diffuse to the binding site. 
It has been shown that if TF diffuses with diffusion con- 
stant D and is trying to bind a site with linear dimension 
a, the fastest on rate is k - u ^ AT iDa, for spherical TF and 
binding site'* (Berg et al 1985). In the diffusion-limited 



approach, if the binding site is empty, as soon as a TF 
diffuses into a region of size a around the binding site, it 



* If assumptions about geometry are relaxed, the prefactor 47r will 
change. 
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FIG. 5 Three Hill regulatory functions with different slopes 
(Hill coefficients h), as in the legend. All functions have Kd = 
1. Input TF concentration is customarily plotted on logarith- 
mic horizontal axis, while the average promoter occupancy n 
is on the vertical axis. The output gene expression is in steady 
state g = (J?T)n(c), i.e. proportional to occupancy. The slope 
of n(c) on the log-log plot at half-induction (c = Kd) is related 
to the Hill coefficient, d(logn)/ci(logc)|/Q = h/2. 



will immediately bind. Then, all dependence on binding 
energy E will be absorbed into the off rate k_ . Intuitively 
we can understand this by imagining that once the TF is 
bound in an energetically favorable configuration, it has 
to wait for a random thermal kick of typical size fcsT 
to unbind, and the probability of that kick being able to 
overcome the binding energy barrier £^ is ^ exp{E /ksT) 
(remember, the more negative the binding energy, the 
stronger the binding). 



B. Phenomenological models of regulation 

In the previous chapter we have shown how thermo- 
dynamic and kinetic models are connected for simple 
cases of gene regulation where a single transcription fac- 
tor binds cooperatively to different numbers of binding 
sites. In many cases, however, several transcription fac- 
tors together regulate a single gene. How can such situ- 
ations be addressed from a theoretical perspective? 

Here we briefly discuss three possibilities, their draw- 
backs and benefits. 

First, if one knows the detailed molecular map of the 
states and their transitions, it is possible to write down 
the diagram as in Fig. |4] and compute the steady-state 
occupancies; alternatively, this can be done in the ther- 
modynamic picture if one knows the binding energies and 



cooperativities. For simple cases, such as a single acti- 
vator and repressor jointly controlling a gene and hav- 
ing an energetic interaction, this is feasible and has been 
done for, e.g., the lac operon in E coli, as we discuss 
later. As we advance towards more and more complex 
organisms, however, this approach loses its appeal: for 
metazoan enhancers, for instance, we don't even know 
the microscopic states (much less their energies), and so 
cannot write down the partition function. While correct, 
this approach does not scale to more complex regulatory 
strategies well. 

Second, we can decide for an ad hoc approach. Here, 
we write down a model probability that the gene is on 
as a function of its TF concentrations, but do not worry 
whether this probability is actually derivable from some 
statistical mechanical system, or alternatively, if there 
is a realistic dynamical scheme that would generate this 
model probability. Often, such approaches borrow the 
intuition from simple thermodynamics results and com- 
bine them into a more complicated regulatory scheme. 
For example, if the gene g is activated by TF A, present 
at concentration ca, and repressed by TF i?, present at 
concentration cs, one could postulate (without deriving) 
that the occupancy of the promoter is 



'n{cA,CB) 



(17) 



This expression assumes that molecules of A bind in- 
dependently to hA sites with dissociation constant Ka, 
and molecules of B bind to hg sites with dissociation 
constant Kb] importantly, we also assume that the joint 
regulation is and-like, meaning that gene g will only be 
activated when both A is bound and B is not bound 



[that's why there is a product in Eq (17)]. More complex 
schemes like this can clearly be derived, and while they 
will not necessarily correspond to any possible thermo- 
dynamic system, they might be useful phenomenological 
models that can be fit to the data. 

Third, we can pick a real thermodynamic model that 
is flexible enough to encompass many possible combina- 
torial strategies of gene regulation but will have a small 
enough number of parameters to connect to available 
data. As in the previous case, this model might not corre- 
spond on a molecular level to the events on the promoter, 
and would thus also qualify as a phenomenological model. 
It would, however, have the advantage of being more eas- 
ily interpretable and understandable within the context 
of statistical physics. One such model is the so-called 
Monod-Wyman-Changeaux model, which we discuss be- 
low. 



C. A model of combinatorial regulation: MWC model 

In this section we'll discuss a thermodynamic model 
that can easily be extended to include combinatorial reg- 
ulation. The model has been motivated by the work on 
allosteric transitions and was used to explain hemoglobin 
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FIG. 6 A schematic diagram of MWC model. Two possi- 
ble states of the promoter, "on" and "off", are separated by 
an energy barrier of L. There are 3 binding sites for the 
transcription factor in this example, to which TFs bind in- 
dependently; their binding energy, however, depends on the 
state of the promoter. Here, 2 of the 3 sites are occupied, and 
are contributing E^^^^ — /i each to the total free energy. If the 
promoter is "off," there is no transcription, if it is "on," tran- 
scription proceeds at rate R and gives rise to Rt molecules of 
output in steady state at full induction. 



portional to the expression of the gene): 



F(on) = 



(1 



(l + e--Ei+^)" + (l + e--EA+M)r 



1 \n 



{l + c/K\)^ + L{l + c/Kl)'-' 



(21) 



(22) 



Equation ( 22 ) is written in the standard form, with the 
identifications K\ = exp{(3E\), K\ = exp(/3i?^) and 
L = cxp(L). 

A regulatory impact of transcription factor A onto the 
regulated gene is described by quantities {n,K'^,K\) 
in the MWC model. There is one additional param- 
eter L, the offset (or "leak") favoring the "off" state. 
Note that the parameters Kj^ of the MWC model are 
not directly comparable to Hill model parameter Kd; 
however, we can make the identification in the regime 
where c/K'^ < 1 and c/K\ > 1 . Then the term 
(1 -I- c/ifj^)" in Eq (22 1 can be approximated with 1, and 
(1 + c/K\)'' « {c/K^'\ Equation ^ then reduces to 



function (Monod et al 1965). When applied to the case 



of gene regulation, the central idea is that as a whole, 
the promoter can be in two states, "on" (1) and "off" 
(0). Remember that in our previous examples we had to 
declare one of the combinatorial states as the "active" 
state; here, this distinction is built into the model by 
assumption. 

The regulatory region has ua binding sites for tran- 
scription factor A. These sites can be bound in both the 
active and inactive state, and molecules of A always bind 
independently, see Fig. [6j However, the binding energy 
for each molecule of A to its binding site is state depen- 
dent, i.e. E\ when the whole promoter is "off" vs E\ 
when it is "on." Let's work out the thermodynamics of 
this system. For each of the two states, we can write 
down the free energies of k molecules of type A bound: 



Fi = k{E\~fi), 



(18) 
(19) 



where fi — log c (we are writing everything in units of 
kgT and dimensionless concentration again), and L is 
how favoured the "off" state is against "on" state even 
with no TF molecules bound. The partition function is 
then 



fc=0 



fc=0 



P(on) 



c" + L{K\y 



(23) 



and we can identify the parameter n in the MWC model 
with the Hill coefficient /i, and the dissociation constant 
of the HiU model, Kd, with Kd = L'^^'^K^. 

In general, for a single gene, the MWC model is not 
much different from Hill, producing sigmoidal curves that 
don't necessarily cover the whole range from to 1 in in- 
duction as the input changes over a wide range. However, 
in the limit where c/K'^ <C 1, we can easily generalize 
MWC to regulation by several transcription factors. To 
see how, rewrite Eq ([2T|) as 



P(on) 



1 

1 + eP(^) ' 



(24) 



where F{c) — — nlog(l -I- c/K\) + L. In this picture, 
the binding and unbinding of transcription factors simply 
shifts the free energy of "on" vs "off" state. We can 
easily see that if K transcription factors /i = A,B,... 
with concentrations regulate the expression of a gene. 



we can retain Eq (24), but write 



F{{c,}) = 



, log 1 



K„ 



(25) 



it is easy to check that positive represent activating 
influences, while flipping the sign of /i makes that gene fi 
repress the expression of g (Walczak et al 2010). 



Recognizing that the sums are simply binomial expan- 
sions^, we get for the probability of the "on" state (pro- 



D. Input/output relations and surfaces 



Note that T,k=o {T)^^ = (1 + 



These considerations lead to a useful abstraction that 
phenomenologically describes the behavior of single ge- 
netic regulatory elements, and summarizes how the input 
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logio(cB) 



FIG. 7 A regulatory surface for a gene with two inputs, an 
activator A present at concentration ca and a repressor B 
with concentration cb- The model is MWC, with L — 5, 
ua = 2, ub ~ —1 (negative since it is a repressor), Ka ~ 0.5 
and Kb = 0.7. 



TF levels get integrated on the promoter into the activ- 
ity of the regulated gene. We need to plot the curve 
(or surface in case of combinatorial regulation) of gene 
activity as a function of all relevant input TF concen- 
trations, g — g{{c^}): this is a nonlinear mapping that 
combines all inputs into a given output. In simple organ- 
isms (prokaryotes), the molecular understanding of these 
relations is probably close to correct: the promoter ac- 
tivity is just the occupancy of the promoter by the RNA 
polymerase and thus is proportional to the rate of mak- 
ing new transcripts. For higher organisms, the molecular 
picture is likely incorrect. Nevertheless, input/output 
surfaces are useful abstractions for thinking about regu- 
lation, and concrete models such as MWC can compactly 
summarize experimental data. 



E. Experiments 

Are there quantitiative measurements of input/output 
relations in gene regulation? 

For systems with a single input and a single output, 
it is commonplace to measure such relations and charac- 
terize them with Hill-like functions. One example of a 
single input, single output function is shown in Fig. |8] 
This relation has been obtained in quantitative measure- 
ments using immunostaining methods and microscopy 
in the fruit fly embryos (Gregor et al 2007 Tkacik et 




-0.5 

ln(Bcd/Bcd^/2) 



FIG. 8 The input/output relation between 2 genes involved in 
the early embryonic development in Drosophila melanogaster 
( [Gregor et al] pOOT] [Tkacik et al| |2008| ). On the horizon- 
tal axis, the concentration of the input transcription factor 
called bicoid, on the vertical axis the expression level of the 
target of bicoid regulation, called hunchback. The data has 
been normalized so that at full mean induction hunchback 
reaches the value 1. Bicoid is expressed in units of its disso- 
ciation constant denoted by Bcdi/2: when the concentration 
of bicoid is Bcdi/2, hunchback is expressed at half-maximal 
levels. Dashed curves show the measurements in different 
embryos, and the error bars show the noise in hunchback ex- 
pression that we'll study in the next lecture. As we will see 
later, development is a very convenient system for extract- 
ing input/output relations because the physiologically rele- 
vant range of inputs is naturally laid down in the form of a 
spatial gradient across the embryo, and the output hunch- 
back level for each input bicoid concentration can be read out 
simultaneously under the microscope in a single field of view. 



|ai} [2OO8I ); the de gree of reproducibility is evident from 
the match between measurements on multiple embryos. 
This curve can be fit very well by a Hill-like activation, 
Eq (16), with the fitted Hill coefficient of 5. The slight 



dip at high induction is a consequence of the fact that 
in the embryo, hunchback (the output) is regulated by 
other transcription factors, not only the input (bicoid)^. 
The situation where the factors other than the explicitly 
measured input influence the expression level of the tar- 
get gene is common in higher organisms - the list of all 
inputs is not necessarily even known, and the extracted 
input/output relations are of necessity phcnomenologi- 



® Hunchback is also thought to self-activate in addition to being 
activated by bicoid. 
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cal, reflecting the direct influence of the observed input 
regulator, but also the indirect influences through unob- 
served intermediaries. 

Perhaps the best worked out example is that of the lac 
operon in Escheri chia coli ([ Jacob et al [1961 ), see Fig[9[ 
The work of Refs ( [Setty et al , 2003; Kuhlman et al[[2007| 
has constructed a thermodynamic model of joint regula- 
tion of lac by its two transcription factors, CRP (acti- 
vator) and LacR (repressor). In a series of experiments 
the concentrations of IPTG (an inducer for LacR) and 
cAMP (an inducer for CRP) has been varied and the 
2D input /output surface has been mapped out. Mak- 
ing this surface consistent with known molecular facts 
about the regulatory proteins (such as the cooperativity 
of IPTG-LacR interaction) has required an in depth un- 
derstanding of the system, including the thermodynamics 
of DNA looping and finding other mechanisms that in- 
fluenced the intra-ccUular concentrations of cAMP (such 
as an enzyme called PDE that degrades cAMP). This 
body of work demonstrated how thermodynamical mod- 
els of regulation, information about relevant molecular 
properties of the regulatory proteins, and quantitative 
experiments can generate real understanding in a (sim- 
ple) biological system. On the other hand, the same work 
highlighted practical problems connected with inferring 
the input/output relations when input is externally ex- 
perimentally controlled: part of the mystery of making 
the models consistent with the measurements was the 
discrepancy between externally delivered concentrations 
of the inducer, and its actual intracellular concentration 
{E coli possesses alternative mechanisms that affect these 
concentrations). In the end, understanding even a sim- 
ple bacterial system such as the lac operon proved quite 
challenging. In the case of Drosophila hunchback / bi- 
coid regulation, similar problems can be avoided, because 
the input (bicoid) gradient is established naturally dur- 
ing early morphogenesis, and no experimental (possibly 
physiologically irrelevant) interference with the system is 
needed. 

Finally, in metazoan regulation our knowledge is much 
more limited. In enhancer regions that control the ex- 
pression levels of genes possibly far away on the DNA, 
transcription factor binding sites form clusters and com- 
binatorial regulation is abundant. Several kinds of TPs 
might bind, each to possibly more than one specific bind- 
ing site in the enhancer. These binding sites surprisingly 
appear to be less specific (i.e. have shorter recognition 
sequence lengths) than in bacteria. Genetic studies have 
shown that in constructs in which some of these sites have 
been permuted, the biological function is retained, while 
other tweaks disrupt the function; it is not clear what 
is the appropriate "grammar" that separates viable from 
non-functioning binding site configurations. Attempts 
are being made, however, to both fit simple thermody- 
namical models of regulation to the data in a predictive 



The lac Operon and its Control Elements 



fashion (Schroeder et al 2004 Segal et al 2008), and 
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FIG. 9 A schematic diagram of the lac operon in Escherichia 
coli. Genes for utilizing lactose are regulated jointly by the 
lac repressor (lad gene and the associated lacR protein), and 
by the CAP (also known as CRP) activator. The bacterium 
expresses lactose genes strongly only when lactose is available, 
and the preferred sugar, glucose, is not. When glucose is also 
available, induction is low, but when lactose is unavailable, 
the genes in the operon are completely shut off. Image from 
Wikimedia commons. 



gap genes respond to spatially varying concentrations of 
the morphogens (maternally deposited transcription fac- 
tors). These studies have revealed a strong role for ge- 
netic cross-repression among the gap genes, as well as 
spatial effects in positioning of the binding sites that 
might be due to the packing and regulatory role of chro- 
matin that can make the genes (in) accessible for expres- 
sion. 



F. Relation to neuroscience 



Neurons, especially in the sensory periphery, can also 



to map out the input/output surfaces of genes involved 



in early fruit fly patterning (Fakhouri et al 2010), when 



be viewed as input /output devices (Rieke et al 1997). In 
the retina, the so-called retinal ganglion cells are sensi- 
tive to features in the visual space: each neuron observes 
a small visual angle and fires when the spatio-temporal 
pattern of light in that visual angle matches the feature 
that the neuron is looking for. The output of the neuron 
can be viewed as a scalar, the probability of spiking r. 
In a typical experimental paradigm, visual stimuli can be 
precisely repeated many times and played to the neuron, 
so the probability of firing at each point in time, r{t), 
is accessible as empirical probability, or firing rate, com- 
puted across many aligned stimulus presentation repeats. 

The input characterization is more problematic. In 
principle, one projects a movie, that is, a set of image 
frames oi N x N pixels each, refreshed every mil- 
liseconds, onto the neuron. Even if the neuron is only 
locally sensitive in time, i.e. its spiking at the cur- 
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rent moment only depends on T previous movie frames, 
the dimensionality of the input space is huge. There- 
fore, naively charting out an input/output relation, r = 
r{N X N X T parameters), is infeasible. 

It turns out, however, that the features that neurons 
look for in this huge dimensional input space are of- 
ten simple. The neuron's sensitivity can be described 
by a spatio-temporal linear filter C, local in space and 
time, also known as the receptive field. A good model 
of the typical sensory neuron is that the neuron first 
takes a convolution of the filter with the movie, s{t) — 
J dt'C{t')*movie{t — t'). The resultings(i) is now a scalar 
that gets mapped through the point-wise (instantaneous) 
input /output relation into the spiking rate r(t). 

This is the traditional phenomenological model of a 
neural response, called the LN (linear-nonlinear) model. 
The linear part describes "feature extraction" and re- 
duces a stimulus of high-dimensionality to a single scalar 
projection. The nonlinear part maps that projection 
through a nonlinearity to generate probability of spik- 
ing, r(t) — r{s{t)). Some neurons are sensitive to more 
than one feature, and thus have, e.g., two linear filters 
£i, C2, generating two projections si, §2 at each instant 
in time. The input/output function is then a surface, or 
a 2D nonlinearity in neuroscience jargon, r = r(si, S2). A 
large effort in neuroscience is expended on deriving meth- 
ods of experimentally probing, and inferring the linear 
filters and non-linearities. As an additional complica- 
tion, the neural behavior is adaptive, meaning that the 
properties of both filters and non-linearities can be dy- 
namically tuned on slow timescales to reflect the changes 
in the statistics of neural inputs (e.g. the change in the 
movie properties, such as average luminosity or contrast). 



III. THE INPUT / OUTPUT DEVICES ARE NOISY 
A. Encoders, decoders, and noise 

Information flow, in biological as well as engineered 
networks, is limited due to noise. To gain intuition about 
the influence of noise, we consider an information trans- 
mission system as a "black box" that takes some input 
signal c, transforms (encodes) it and transmits it. The 
output signal g is delivered to a readout device - a de- 
coder - that then tries to determine which input was 
sent. If there were no noise, each input would be uniquely 
mapped to some output, and this mapping would be fully 
specified by a one-to-one input /output relation g — g{c). 

When noise is present, there is no such one-to-one map 
in general, and we must take into account the possibility 
that given an input symbol c, the system output is not 
uniquely determined. Instead, there exists a distribution 
over g, P{g\c), that tells us how likely we are to receive 
a particular g at the output if the symbol c was trans- 
mitted. This distribution is also known as the encoding 
distribution. A listener at the output could then use the 



FIG. 10 a) The schematic diagram of the channel receiving 
inputs c and generating outputs g according to a probabilistic 
map P[g\c). The inputs are drawn from P(c); the distribution 
of outputs is then fully determined, P{g) — P{g\c)P{c). 
b) A simple example of the binary channel. The input and 
output values can be or 1. If is sent as an input, with 
probability 1 — ei the channel transmits the without a mis- 
take, but with probability ei the channel corrupts into a 1. 
Similarly, 62 gives the probability of the channel mistakingly 
transferring a 1 into a 0. 



B ayes' rule: 



P{c\g) 



P{9\c)P{c) 

Ee^(.9|c)P(c) 



(26) 



to construct the decoding distribution, that is, an inverse 
mapping for the likelihood of each input signal c, given 
that g was received. Two things are worthy of note: (i) 
the decoding party needs to know P{c), the distribution 
of inputs that are being sent (for instance, if c are the 
letters of the English alphabet, one needs to know the 
letter frequencies in the written language to decode opti- 
mally); (ii) P{c\g) is not the final decoding result - the 
decoding party does not want a distribution over possible 
c that were sent, but instead want the specific c that was 
"most likely" sent. Decoding thus requires an additional 
rule for choosing the best guess of c from P{c\g), and 
there are various optimal choices for this rule, depending 
on how and which errors in decoding are penalized. For 
example, if c were continuous, we might want to choose 
a decoding rule that picks the best guess c out of the 
decoding distribution P{c\g), such that the estimated L2 
norm between the true transmitted c and dec oded c is 
minimized, c(g) — argmin Jdc{c- cfP{c\g) (|Rieke et 



|ai| [l997 ). 



We return to an in-depth discussion of probabilistic 
encoding, decoding, and information transmission in Sec- 
tion |IV[ To proceed, we note that there is a useful sim- 
plification at hand in the case when the symbols c and 
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g are continuous: suppose that given c, the responses g 
are weU clustered around some mean response, g{c), but 
there is also some "spread" around the mean, charac- 
terized by the variance o'g{c). The connection between 
these two quantities and the fully probabilistic picture is 
as follows: 

g{c) = J dggP{g\c), (27) 
a2(c) = J dgig-gfPiglc). (28) 

These two functions are known as conditional mean and 
conditional variance, and they can easily be extracted 
from the distribution P{g\c), if it is known. A noise-free 
deterministic limit is recovered as o'g(c) — > 0, in which 
case P{g\c) tends to a Dirac-delta distribution, P{g\c) = 

s{9-m)- 

Unfortunately, the full conditional distribution of re- 
sponses given the inputs, P{g\c), is usually only available 
in theoretical calculations or simulations, since in reality 
we rarely have enough data to sample it. In the case of 
gene regulation, sampling would involve changing the in- 
put concentration of TF, c, and for each input concentra- 
tion, measuring the full distribution of expression levels 
g. More often than not we only have enough samples to 
measure a few moments of the conditional output dis- 
tribution, perhaps the conditional mean and conditional 
variance. Given these measurements and P{g\c) that is 
experimentally inaccessible directly by sampling, we can 
try making the approximation 

P{g\c)^g{g;g{c),al{c)), (29) 

that is, we assume that P{g\c) is a Gaussian, with some 
input-dependent mean and variance. 

In the presented setting, the mean input /output re- 
sponse and the noise in the response cleanly separate: 
one is given by the conditional mean, and the other by 
conditional variance. The noise can be thought of as 
the fluctuations in the output variable while the input is 
held fixed. Recall that we are discussing all information 
processing systems in equilibrium, that is, when the dy- 
namics in g has reached steady state (and all variation 
in g at given c is due to noise). 

With the conditional mean and variance in hand, we 
can now create a detailed characterization of a noisy in- 
put/output regulatory element by means of two func- 
tions, as shown in Fig. [Tl] 

B. Sources of noise 
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FIG. 11 The input/output relation, showing the mean re- 
sponse as well as noise. The input c on the horizontal axis is 
mapped into the output g on the right vertical axis through 
the input/output relation (thick black line); this summarizes 
the mean response, g{c). For each c, there is a distribution 
of possible values of g, shown in the inset for c/Kd = 1. 
The width of that distribution, o-g(c), is the measure of 
input-dependent noise, and is plotted as errorbars on the in- 
put/output relation. The errorbars themselves can depend on 
the input, as shown by the gray line and the corresponding 
scale on the left vertical axis. 



level of noise and available bandwidth in the transmission 
lines. Such noise in electronic devices is well understood. 
Two fundamental sources of noise in electronic equipment 
that contribute are the Johnson noise and the shot noise. 
Johnson noise is due to random thermal fluctuations that 
jiggle the charges and thus induce random fluctuations in 
voltage (or current flowing through the resistance). 

Shot noise is due to quantal nature of charge carriers. 
"Current" is a macroscopic (average) quantity, and is a 
result of an integer number n of elementary charges q, 
flowing through the cable in a time interval T. We can 
write the current as the total charge flowing during time 
T, i.e. / — nq/T. If we repeated the measurement of du- 
ration T several times and were able to count individual 
charges, we would see that on average., n charges flow, 
but this number has a Poisson fluctuations around the 
mean across our repeats of the experiment. A charac- 
teristic of a Poisson random process^ is that the mean is 



What factors contribute to the noise in the response? 

Let us start by briefly considering a purely physical 
system first. In the times of analog modems and noisy 
telephone lines, the modems' information rates increased 
but started to saturate at about 30kbit/s. This is close 
to the theoretical limit predicted by Shannon's informa- 



tion theory (that we introduce in Section IV), given the 



Suppose that discrete point events (each of infinitely short du- 
ration) occur independently, and that on average we observe n 
such events in a time window T. The probability of observing 
n events in each instance of the time window T is then Poisson 
distributed, i.e. 

P{n\n) = —e-"n". (30) 



15 



equal to the variance, that is 



(31) 



We can then compute the observed fluctuations in the 
current: cr| = af^q^ /T"^ ~ Iq/T. In an experiment in 
which a constant current / is flowing, and we measure for 
time T, (t| is the variance in the current that we would 
observe across the repeats of the experiment simply due 
to the fact that the charge is composed of elementary 
units q which are not infinitely divisible. 

Sources of noise that are directly analogous to the 
Johnson noise and the shot noise also act on the molecu- 
lar level and can be observed in a wide variety of settings 
in biology. Before continuing, consider two additional 
interesting examples. 

In human vision, the photoreceptors in our retinas 
are very efficient at responding to small photon fluxes at 
low ambient light levels, when the retina is dark adapted. 
In an interesting set of experiments, various groups have 
delivered flashes of light of mean intensity / to human 
observers that needed to report whether they saw the 
flash or not. This procedure allowed the experimenters 
to trace out the psychometric response curve, with mean 
intensity J on x axis, and the probability of detection on y 
axis, summarizing the limits to human vision. This curve 
is not a step function, but rather has a smooth transi- 
tion region, and for a while it was hypothesized that this 
"fuzziness" in transition might be due to the processing 
downstream from the retina: after all, the signals need to 
travel into the brain, where we take a conscious decision 
that might be corrupted by noise, because, for example, 
our attention during the experiment faltered, or for many 
other possible reasons. 

Retinal processes that underlie vision are able to detect 
to single photons: a photon hitting retinal (the pigment 
in rhodopsin molecules) can cause a chain of chemical re- 
actions with high gain that ends up delivering a pulse at 
the photoreceptor output, and that pulse subsequently 
gets reported to the central brain in the spike trains of 
retinal ganglion cells. The fuzziness in the transition re- 
gion that was observed in the experiments was not due 
to imperfect circuitry of the retina or the brain; rather, 
when delivering light pulses with intensity /, the experi- 
menters were really delivering NztV^ photons (variance 
due to Poisson shot noise) in a given interval of time®. 
The smooth rise in the psychometric curve was due to 
this spread in the number of photons actually delivered 
- humans really respond reliably to as few as 6 photons, 
and at those low light levels, the fractional fluctuation in 



The mean number n, or alternatively the rate r = n/T, are 
sufficient statistics for Poisson processes. 

The number of photons emitted is related to the wavelength of 
the light and the duration of the pulse. The number of photons 
incident on the photoreceptors in addition depends on the geom- 
etry of the eye and the light source, and possible scattering and 
absorption of light in the intervening medium. 



the number of emitted photons at each pulse is significant 
(i.e. ~ I/a/TV). Remember that the inability to deliver a 
precise number N of photons is not due to experimenters' 
lack of attention to detail: the emission of photons is a 
quantum probabilistic process and the Poisson shot noise 
is a basic physical limit that cannot be circumvented by 
a classical light source. 

In addition to shot noise, we can also find the analogs of 
thermal (Johnson) noise in early vision. Due to thermal 
fluctuations, there is a small chance that the molecule of 
retinal will undergo a spontaneous conformational transi- 
tion exactly mimicking the one that would normally have 
been caused by an impinging photon. There is no way for 
the downstream neural circuitry to distinguish whether 
such an event was a "real," photon-induced transition, 
or a thermal fluctuation. Both in our eyes and in CCD 
cameras, this so-called "dark current" acts as a constant 
background hash causing false positi ves even if the de- 
tection circuit were otherwise perfect (Bialek 2002). 



In bacterial chemotaxis, bacteria like Escherichia 
coli implement a well-studied strategy of swims and tum- 
bles, that is, periods of swimming in straight lines when 
the flagella coherently bundle in a cork-screw-like pro- 
peller, and periods when flagella turn incoherently, caus- 
ing the bacterium to randomize its direction. The bac- 
teria also like to swim towards sources of molecules that 
they find useful ( "chemoattractants" ) , and they achieve 
this by modulating the frequency of random tumbles: as 
long as the concentration of chemoattractants that the 
bacterium senses is increasing with time, the tumbling is 
repressed (since the swimming is in the correct direction) ; 
if the concentration is decreasing, tumbling is enhanced. 

Often, these chemoattracting chemicals will be present 
at very low concentrations, and one can ask the ques- 
tion "how well can bacteria, even in principle, tell along 
which direction in space the concentration is increasing" ? 
If the chemoattractant c were at high enough concentra- 
tion, one could imagine the bacterium having a detector 
for c in its front and its back, which would allow it to 
measure the gradient of c by simply measuring the front- 
to-back difference. Operating at very low c concentra- 
tions, however, we need to stop thinking about continu- 
ous and infinitely precise concentration fields and start 
thinking about single ligand molecules. As this hypothet- 
ical bacterium "sniffs" with its front detector, it might 
measure, in time T, iVfront molecules on average at the 
front, and iVback molecules in the back. But the molecules 
are discrete entities, so each of these measurements will 
fluctuate around the mean by V Affront and \/ iVback from 
measurement to measurement. Taking a difference be- 
tween the front and the back counts, in the presence of 
this noise (and with no temporal averaging), is a very 
bad strategy for inferring the direction of the chemoat- 
tractant source. If A^fiont — ^back ^ V -^front , then the 
estimate of the gradient will be completely swamped by 
noise. 

While this direct strategy for estimating the gradi- 
ent is implemented by certain eukaryotic cells (which 
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are much bigger and therefore able to measure the con- 
centration differences better), bacteria - to deal with 
the noise problem - implement a very different strategy 
of integrating the change of measured chemoattractant 
concentration over time. The trick is to realize that 
dc{x{t))/dt = dc/dt + Vc • v; therefore, in a tempo- 
rally constant gradient, the time derivative of the con- 
centration measured by the bacterium at its position, 
x{t), swimming with velocity v{t) = is related to 

the spatial gradient of the concentration. It is also clear 
that with this strategy, bacteria can be experimentally 
"tricked" into believing that they move in a spatial gra- 
dient of the concentration, where really the concentration 
is spatially uniform but variable in time; this has been 
the basis of many beautiful chemotactic experiments by 



H. Berg and coworkers (Berg et al 1972 1. 



In summary, in biology many important processes de- 
pend on events that occur between very small numbers 
of molecules. This can either be the detection of a single 
photon by the photopigment in the retina, or the detec- 
tion of chemoattractant molecules by the swimming bac- 
terium, or perhaps the binding of the transcription factor 
to its binding site. After all, in the last example one can 
think of the binding site trying to detect or "measure" 
the TF concentration, making this example analogous to 
detecting chemoattractants or photons; it doesn't really 
matter that in one case the ligands are internal to the 
cell and in the other the ligands are external. On gen- 
eral grounds, we expect that in all instances where small 
numbers of molecules are involved in control processes, 
noise might be an issue. 



C. Impact of noise in biological systems 

When we study problems in mechanics, we begin by 
clearly delineating what is our system of study and what 
is the environment. The same distinction is useful when 
we talk about noise. For example, our system of study 
might be a single genetic regulatory element, denoted by 
c — ^ and its environment is the cellular environment 
of the particular single cell in which this system is em- 
bedded. This distinction is central because in an experi- 
ment, we will often observe many instances of the system, 
and in each instance measure the input c and output g 
- this is how many noise experiments using single-cell 
microscopy or FAGS (fluorescence activated cell sorting) 
are performed. 

Once we measure the response g given the input c and 
the fluctuation in the response crg(c), we need to ask what 
sources of variability contribute to that measured fluctu- 
ation. One source is the intrinsic noise in the regula- 
tory element c ^ g itself, contributed by the inherently 
stochastic molecular processes that we discussed above 
and will soon return to. Another source of noise, how- 
ever, called extrinsic noise, arises because the cellular en- 
vironment changes from cell to cell; that is, one cannot 
guarantee that the system of study is always exposed to 



the same conditions. For example, the RNA polymerase 
concentration might fluctuate from cell to cell, and even 
if the process c g would in itself contribute no intrin- 
sic noise, the fact that this is transcriptional regulation 
process that involves RNAP and RNAP varies in the en- 
vironment (from cell to cell), will induce some variance 
in measured g across the population of cells. There is 
nothing fundamentally different about the intrinsic and 
extrinsic noise; the distinction is useful solely when a sys- 
tem of study can be enclosed into a conceptual box and 
separated from the environment, and the noise in the sys- 
tem can be separated from the nois e in its environme nt 
by clever experimental techniques^ ( Swain et al 2002 1 . 



We introduced the abstraction of input / output de- 
vices in the previous lecture. Correspondingly, it makes 
sense to divide the intrinsic noise into the input and out- 
put contributions. In a regulatory process c — > where 
TF c controls gene g, the total noise in g, cr^(c), can arise 
from two kinds of sources. The flrst source, the output 
noise, deals with the generation of output. In our case 
this is the transcription of mRNA molecules and their 
translation into protein molecules g. This source of noise 
would be present even if there were no transcriptional 
regulation whatsoever, i.e. if the gene were constitutively 
expressed. By making more and more mRNA and pro- 
tein molecules, the relative impact of the output noise 
can be reduced. 

The second source of noise is called the input noise, and 
this arises because the concentration c at the binding site 
location itself is fluctuating (we will discuss why soon). 
This source of noise is important for two reasons: (i) it 
gets mapped through the nonlinear input/output relation 
g{c) to give rise to the total measured noise in g; (ii) the 
input noise cannot simply be reduced by clever design at 
the output end. This is familiar to anyone who has dealt 
with electronics - the noise in the input to the amplifier 
is the noise that no amount of gain can reduce away. 

When studying biological systems, we are faced with 
additional sources of fluctuation that are sometimes also 
referred to as noise, but contribute in addition to the 
fundamental sources of variability (such as thermal or 
shot noise sources discussed above). The fundamental 
sources of noise, or physical limits to precision in biology, 
are the lower bound on the noise that biology cannot 
avoid. But clearly biological processes can be more noisy 
than the lower bound set by physics^". 



^ This is similar to internal and external forces acting on the sys- 
tem in mechanics. There is nothing physically different between 
external and internal forces. However, we need to account for 
them properly and not mix them up when writing down the dy- 
namical equations for the system of interest. 

An interesting choice of systems for study are therefore those sys- 
tems which are believed to be under strong evolutionary pressure 
to reduce noise as far as possible, perhaps down to the physical 
limits to precision. It is thought that dark vision is one of such 
systems, because it gives selective advantage to both predators 
or pray that can see after sunset, or even in starlight. Perceptual 
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One such additional source is experimental noise that 
corrupts our measurements, independently of the actual 
noise in the biological system. Often, we don't have any 
theoretical understanding of what form this noise has, 
unlike in physics where the responses of the measurement 
devices can precisely be characterized. For example, mi- 
croarray assays are a very popular high-throughput way 
of measuring the activity levels of various genes. How- 
ever, there is no clear understanding of how the real level 
of mRNA in some physical units maps into the log light 
intensity ratios usually reported [in other words, what 
is -P(log light ratiojmRNA level)]. Therefore, inference 
from the measurement to the underlying quantity of in- 
terest (which you can think of as decoding the raw ex- 
perimental data into the the true mRNA levels) is often 
done using some ad hoc procedure; in Section [V] we dis- 
cuss how information-theoretic inference can circumvent 
precisely this problem of unknown experimental noise. 

Finally, how important really is noise to biology? The 
rough answer is that this depends on the system, and 
in particular to whether ultimately we can trace cellular 
(or neural) decisions to single microscopic rare events. If 
that is the case, the expectation is that the noise will 
play a large role. For example, the A switch in the phage 
controls the fate of the virus, i.e. whether it will stay lyso- 
genic or turn lytic. The bistable switch is controlled by a 
small number of transcription factor repressor molecules, 
called cro and cl, that compete for the same binding sites 
(Ptashne, 1989). In this case, the fate of the cell is tied 
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to a single molecular decision, and therefore the stochas- 
ticity is important ( McAdams et al 1997 ) . The same can 
be said for rates of "spontaneous switching," where ther- 
mal noise is able to flip a genetic switch. This is a rare, 
yet potentially important event, and considerable theo- 
retic effort goes into computing the frequencies of such 
rare events. 

Overall, in both genetic regulation and neural systems, 
the noise limits the amount of information that the net- 
work can transmit. Nevertheless, the noise can often be 
treated as a small fluctuation riding on top of the sig- 
nal. In protein-protein signaling networks (such as the 
two-component systems in bacteria), the intrinsic noise 
is thought to be small, because the reactions involve hun- 
dreds to thousands of signaling molecules. On the other 
hand, these molecules are proteins, transcribed from the 
genes and regulated by transcription factors, so the ex- 
trinsic noise can be large due to the slow (compared to 
the timescales of signaling reactions) random fluctuation 
in the total numbers of signaling proteins. There is ev- 
idence that biology tries to choose network wirings that 
make signaling networks robust with respect to this ex- 
trinsic source of slow fluctuations (Barkai et al 1997). 



studies that show human visual sensitivity approaching the limit 
set by the Poisson statistics of incoming neurons support this 
hypothesis. 
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FIG. 12 A fully stochastic simulation of a simple model of 
gene expression using reactions specified in Eq ([9|. The sim- 
ulation starts with g(t = 0) = proteins; the steady state 
is reached after about 70 minutes. On the left, the trajecto- 
ries of 20 simulation runs. On the right, the mean trajectory 
plotted in a solid line; mean ± 1-std plotted in dashed lines. 
The envelope measures the steady state level of noise due to 
(i) random promoter switching and (ii) the shot noise in pro- 
ducing the output molecules. 



D. Derivation of noise for simple gene regulation 

Let's return to the simple gene regulation scenario of 
Fig. [3) We will sketch how the noise can be derived in 
this model using the Langevin approximation, and give 
a back-of-the envelope estimate for the terms that we do 
not compute here. The reader is invited to view the full 



derivation in Ref ( Tkacik et al 2008 ) 



We start with the dynamical equations: 



dn 
'dt 
dg 
'dt 



fc+c(l — n) — k^n + ^„ 



Rn g + ^g, 

T 



(32) 
(33) 



where again we take the binding site occupancy n to be 
between and 1, and the expression level of the out- 
put gene is g; g is produced with rate R when the bind- 
ing site is occupied, and the proteins have a lifetime of 
T. We have already shown that the equilibrium solution 
of this system is n = kj-c/{kj^c -I- fc_) and g = {RT)n. 
Here we are interested in the fluctuations, crg(c), around 
the steady state, that arise purely due to intrinsic noise 
sources: (i) the fact that the binding site only has two bi- 
nary states that switch on some characteristic timescale, 
(ii) the fact that we make a finite number of discrete 
proteins at the output, and (iii) the fact that the input 
concentration c might itself fluctuate at the binding site 
location. 

One approach would be to simulate the system of 
Eqs ( 32|33 ) exactly using the Gillespie SSA algorithm. 
For a given and fixed level of input c, the results of 20 
such simulation runs are shown in Fig. 
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To compute this noise analytically instead of using the 
simulation, we have introduced random Langevin forces 
£.n, S.g- Consider the second equation, Eq (33). A sin- 
gle protein is produced anew, or is degradeoTas an ele- 
mentary step (since you don't make half a protein). In 



18 



equilibrium, the production term Rn balances the degra- 
dation term, g/r. Now consider some time T in which 
RTn — gT/r « 1, i.e. one molecule is produced or de- 
stroyed on average and with equal probability. While 
the expected change in the total number in equilibrium 
in time T is zero, the variance is not: the variance is 
equal to (production of 1 molecule)'^ + (degra- 
dation of 1 molecule)^ — 1. In general, the variance will 
be T{Rn -\- g/r) if we measure for time T. If you are 
familiar with random walks in ID, this sounds very fa- 
miliar: the mean displacement is (because leftwards 
and rightwards steps are equally likely) , but the variance 
in displacement from the origin grows with time T. 

Statistical physics tells us that in order to reproduce 
this variance in a dynamical system, we have to insert 
Langevin forces with the following prescription: 







- [Rn + g/mt^t'). 



(34) 



The mean random force is zero, it is uncorrelated in time, 
and it has an amplitude such that the random kicks have 
variance equal to the leftward and rightward step size; 
this will recover our intuition about ID random walks. 
Similarly, {in{t)in{t')) = {k+c{l - n) + k^n)6{t - t')- 

To proceed, we first linearize Eqs ( 32|33 ) around the 
equilibrium, by writing n{t) = n + Sn{t), g{t) — g + Sg{t). 
Then we introduce Fourier transforms: 
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(35) 
(36) 



Fourier transforms of ^„ and are simply ^„ = 2A:_n and 
= 2Rn, respectively (because the Fourier transform of 
a delta- function is 1, and we have also used the fact that 
in equilibrium, the two terms that contribute to each 
Langevin force are equal). 

With this in mind, the system of equations in the 
Fourier space (denoted by tildes) now reads: 
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-Sg + ig, 



(37) 
(38) 



where — (k^c + k^). 

We ultimately want to compute Cg(c). The total vari- 
ance is composed from fluctuations at each frequency oj. 



integrated over frequencies"'^^: 

did 
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{Sg{uj)Sg* (uj)) = 
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(39) 



where <S'g(w) is called the noise power spectral density of 
g, and the asterisk denotes complex conjugate. We see 
that we need to solve for Sg first from Eqs. ( 37|38 |: 
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(40) 



Next, we compute {Sg{uj)Sg* {m)) . Recalling the defini- 
tions of {£,£,*) [Eq pil], wc find that 



Sg{Uj) = 



4^- («) 



The binding and unbinding of the promoter is usually 
much faster than the protein decay time, Tc <C t. Using 
this and the fact that dx{x^ + 1)^^ = n, we finally 
find 

<ylic)^-gic)+^-^n{l-nf. (42) 



If we normalize the expression level g such that it ranges 
between (no induction) to 1 (full induction) by defining 
g = g/{RT), then the noise in g is 



Rt 



9+1 5(1-3)' 



(43) 



Our result is lacking at least one important contribution 
to the total noise. The formal derivation of this term is 



involved (Tkacik et al, 2008 Setayeshgar et al 2005), so 



we will estimate it here up to a prefactor. In our deriva- 
tion we have not taken into account that the molecules 
of transcription factor are brought to the binding site by 
diffusion. The diffusive arrival of molecules into a small 
volume around the binding site is a random process as 
well: it will induce some noise in occupancy of the bind- 
ing site, and thus in the expression level g. This is the 
contribution we are going to estimate. 

Suppose that the binding site is fully contained in a 
physical box of side a. When the average TF concentra- 
tion in the nucleus is fixed at c, the average number of 
molecules in the box is iV = a'^c. This, however, is only 
the mean number; if we were to actually sample many 
times the number of molecules in the box, we would find 
that our counts are distributed in a Poisson fashion, with 



Because the noise process is stationary (time-translation invari- 
ant), the noise covariance {5g{t)Sg{t')) = Cg{\t — t'\) will depend 
on the difference in time only, and going into the Fourier basis 
will diagonalize the covariance matrix. The total noise variance 
is the integral over these independent Fourier components, and 
that is equal by Parseval's theorem to the total noise variance 
obtained by doing the corresponding integral in the time domain. 
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a variance equal to the mean: a% = N. This is just the 
famihar shot noise in a new, molecular disguise! 

How can one reduce the fluctuations ct^? As always, 
one can make more independent measurements, and aver- 
age the noise away. With M independent measurements, 
the effective noise should decrease, cr% = crff/M. Sup- 
pose the binding site measures for a time r (the protein 
lifetime, the longest time in the system). How many in- 
dependent measurements were made in the best possible 
case? It takes to = a^/D time for the molecules to dif- 
fuse out of the box of size a and be replaced with new 
molecules; if we take snapshots and count the molecules 
at intervals faster than to, we are not making independent 
measurements. Therefore M = t/Iq = tD/o^. Plug- 
ging this into the expression for effective noise, we find 
cr^^g = a^c X /{Dt). Since N — a'^c, it follows that 
cr?, = a^cr^, and finally: 
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(44) 



Equation (44) is a fundamental result: any detector of 



linear size a measuring concentration c, to which ligands 
are transported by diffusion with coefficient D, and mak- 
ing measurements for time r, will suffer from the error 
in measurement in concentration, given by cTc- This con- 
tribution to the noise is called dijfusive noise, and it is a 
special form of input noise. 

To assess how this error maps into the error in the 
gene expression g, note that any error at the input can 
be propagated to the output through the input/output 



relation, g{c) [see Fig 13 



(45) 



Adding the diffusive noise to previously computed terms 
in Eq pSl), we find: 



1 



1 



9il-^f + ^y^. (46) 
UacT 



Let us stop here with the derivation, interpret the 
terms and summarize what we have learned so far. We 
tried to compute various contributions to the noise in 
the expression of gene g, in a simple regulatory element 
where the TF c regulates g. In any real organism, such a 
small regulatory element will be embedded into the reg- 
ulatory network, and c will experience fiuctuations on its 
own that will be transmitted into fluctuations in g, the 
so-called transmitted noise, in addition to intrinsic noise 



calculated here ( Pedraza et al 2005 ) 



On top of intrinsic and transmitted noise sources, the 
output will also fluctuate due to the extrinsic noise be- 
cause the cellular environment of the regulatory network 
is not stable. But even without these complications, we 
can identify at least three contributions intrinsic to the 
c — g regulatory process: 

Output noise. This is the flrst term in Eq (46 1, where 




FIG. 13 Propagating the noise in the input ctc, through the 
mean input/output relation, g(c), into the effective noise in 
the output, cjg. The variances are related by the square of the 
local slope of the input/output curve, dg/dc. 



noise that arises because we produce a flnite number of 
discrete output molecules. In the simple setting discussed 
here, the proportionality factor really is 1 [when g is mea- 
sured in counts, as in Eq (42 1], and this is a true Poisson 



noise where variance is equal to the mean. If we treated 
the system more realistically, with separate transcription 
and translation steps, the proportionality constant could 
be different from 1; a more careful derivation shows that 
then, (t| = (1 + 6)/(i?r)g, where h is the burst size, or 
the number of proteins produced per single niRNA tran- 
script, on average (Tkacik et al 2008). This is easy to 
understand: the "rare" event is the transcription of a 
mRNA molecule, and that has true Poisson noise statis- 
tics, but for each single mRNA the system produces b 
proteins, and the variance is thus multiplied by b. 

Input promoter switching noise. This is the sec- 
ond term in Eq ( 46 ) . The source of this noise is binomial 



the variance cr| oc g. Funamentally, this is a form of shot 



switching of the promoter, as it can only be in an in- 
duced (n = 1) or empty (n — 0) states. If we interpret 
n as the probability of being occupied, then the variance 
must be binomial n{l — n). Fluctuations between empty 
and full states of occupancy happen with the timescale Tc 
[see Eq (37)], and the system averages for time r, so t/tc 
independent measurements are made, reducing the bino- 
mial variance to n{l — n)Tc/T. Since r^fc- — {1 — n) and 
n = g, we recover the switching term, g(l — g)"^ / (Jz-t). 

This term depends on the microscopic way the pro- 
moter is put together, hence the dependence on the ki- 
netic parameter Regardless of these details, how- 
ever, every promoter that has an "on" and "off" state 
will experience fluctuations similar in form to these de- 
rived here. In our example, fc_ is the rate of TF unbind- 
ing from the binding site and this is usually assumed to 
be very fast compared to the protein lifetime (in other 
words, the occupancy of the promoter is equilibrated on 
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the timescale of protein production). In other scenar- 
ios that effectively induce gene switching, however, this 
assumption of fast equihbration might not be true. In 
particular, attention has lately been paid to DNA pack- 
ing and regulation via making the genes (in) accessible to 
transcription using chromatin modification. The packing 
/ unpacking mechanisms are thought to occur with slow 
rates, and such switching term might be an important 
contribution to the total noise in gene expression (Raj et 
ai||2006|. 



Input diffusion noise. The last term in Eq (461, 



as discussed, captures the intuition that even with the 
fixed average concentration c in the nucleus (that is, 
even if c did not undergo any fluctuation relating to 
its own production, degradation and regulation), there 
would still be local fluctuations at its TF binding site lo- 
cation, causing noise in g. This contribution is important 
when c is present at low concentrations. As an exercise, 
one can consider the approximate relevance of this term 
in case of prokaryotic transcriptional regulation, where 
D ~ 1 fXTc? / s, the size of the binding site a ~ 3 nm, 
the relevant TF concentrations are in nanomolar range, 
and the integration times in minutes. It has been shown 
that this kind of noise also represents a physical limit 
in the sense that it is independent of the molecular ma- 
chinery at the promoter, as long as the predominant TF 
transport mechanism is free diffusion. 



What we presented here theoretically was a simple ex- 
ample, but even so we'll see in the next section that 
the correspondence with the experiment is unexpectedly 
good. What is important for the lecture, however, can 
be summarized in the following observations: (i) Not 
only can we make models for mean input/output rela- 
tions, but we can compute the noise itself, as a function 
of the input. Noise behavior is connected to the kinetic 
rates of molecular events, which are inaccessible in any 
equilibrium measurement of mean input/output behav- 
ior. Therefore, if noise is experimentally accessible, it 
provides a powerful complementary source of informa- 
tion about transcriptional regulation, (ii) There are fun- 
damental (physical) sources of noise which biology can- 
not avoid by any "clever" choice of regulatory appara- 
tus; thus the precision of every regulatory process must 
be limited. These sources all fundamentally trace back 
to the finite, discrete and stochastic nature of molecu- 
lar events. In theory, the corresponding noise terms thus 
have simple, universal forms, and we can hope to measure 
them in the experiment, (iii) There are sources of noise 
in addition to the fundamental, intrinsic ones, including 
extrinsic, experimental, etc. The hallmark of a good ex- 
periment is the ability to separate these sources by clever 
experimental design and/or analysis. We'll show how ex- 
trinsic and intrinsic noise sources can be separated in the 
examples below; in Section [V] we show how to deal with 
unknown experimental noise. 



E. Experiments 

With the advent of quantitative microscopy, the use 
of protein-GFP fusions and FACS measurements, noise, 
precision and reproducibility in gene regulation have be- 
come central themes in biophysics and molecular biology. 
A milestone has certainly been the two-color experiment 
of Elowitz and coworkers ( Elowitz et al 2002 1 , allowing 



the separation of intrinsic and extrinsic noise sources. 

The basic idea of the two-color experiment is sim- 
ple. To study a genetic regulatory mechanism c — )■ g, 
one engineers a bacterium to have two (almost) identi- 
cal genes differing only in the color protein fusions; say 
gi = g—CFP and g2 = g—YFP. Both of these genes are 
regulated by the same transcription factor, c, and have 
the same promoters. The bacteria grow under the micro- 
scope, and for each bacterium, it is possible to collect the 
joint measurements of (31,52) at a given fixed induction 
level (related to c or the concentration of its inducer). 

The basic realization of the experiment is as follows: 
scatter-plotting the values of (gi, (72) collected from a cell, 
one can split the total variance in this cloud of points into 
the variance along the "correlated" axis (along the equal- 
ity line), and the "perpendicular" axis, as in Fig. |14[ 
These two orthogonal contributions are then identified 
with the extrinsic and intrinsic noise strengths, respec- 
ticely. The fiuctuations in gi and 32 can be correlated 
only because they don't happen in the system c —?> g it- 
self, but in its intracellular environment, which affects 
both copies of gi and 52 equally, in a correlated fashion. 
The intrinsic noise, however, is due to the terms we pre- 



viously discussed in relation to Eq (46), and would be 



present even if the cellular environments were perfectly 
stable and reproducible. The intrinsic noise is uncorre- 
lated, because random events in the c — ^ g regulatory 
system happen independently in each of the two-color 
replicas in the cell. This "two-color" trick has enabled a 
whole set of experiments in which contributions from var- 
ious steps in more complicated regulatory schemes, e.g 
cascades c — » gi — » ff2, have been teased apart (Pedraza 



et al 2005 



Hooshangi et al 20051. Again, note that "in- 



trinsic" and "extrinsic" are a matter of where one chooses 
to draw the boundary of the system and which part of the 
regulatory element gets replicated to apply the two-color 
paradigm. 

A number of studies have since focused on the noise in 
prokaryotic gene expression. The findings indicate that 
the dominant sources of noise are the output (intrinsic) 
noise of making mRNA transcripts with relatively short 
correlation time (in minutes), and the extrinsic noise with 
long correlation time (of the order of a cell cycle) ( Rosen- 



feld et al, 2005). The noise in units of the mean, Cg/g, 



is very roughly of the order of ~ 20%. A similar set of 
results was obtained in a high-throughput essay in yeast, 
where the noise was found to scale with the mean with a 



large prefactor, consistent with bursty expression (Bar- 



Even et al 2006). An earlier study in the GAL promoter 



in yeast has, however, found a significant contribution of 
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FIG. 14 The schematic diagram of the two color experiment, 
a) The system of interest (in a square box) is a gene g reg- 
ulated by the transcription factor c. Two identical copies of 
the gene and its promoter are inserted; one gene is fused to 
yellow, while the other to the cyan fluorescent protein. The 
system is embedded into the cellular environment, and both 
copies of the gene share the same pool of transcription fac- 
tors, RNA polymerase molecules etc. At a fixed level of c, the 
joint readouts in the yellow and cyan channels are taken and 
scatter-plotted against each other, as in b). Some of the vari- 
ance is correlated, because the two copies in our system share 
the same cellular environment and are coherently impacted 
by the fluctuations in c, RNAP etc. This is the extrinsic 
noise component. Orthogonal to that is the uncorrelated (or 
intrinsic) noise component, due to the fluctuations that hap- 
pen separately and independently in each of the two identical 
gene copies in our system. 



a noise that looks like the input switching noise, and has 
explained it by the cycling of the promotor through its 
microscopic states ( Blake et al 2003 ) . 



Later work in higher organisms has revealed an ever 
more important role of the intrinsic input noise, both 
the switching and the diffusive components. In mam- 
malian cells, it has been possible to observe the number 
of mRNA molecules using the FISH method, and it was 
found that whole genes (or even sets of coUocalized genes 
on the DNA) stochastically switch on and off; this might 
be a consequence of chromatin remodeling mechanisms 



( |Raj et al||2006 |. 

Our work in gene regulation during early Drosophila 
morphogenesis has tried to quantitatively connect the 
model of Eq (46) with the experimental data (Gregor 



et al[ |2007| |Tkacik et al[ |2008[ ). During development, 
each nucleus in the embryo of the fruit fly experiences a 
spatially varying concentration of the input transcription 
factor, bicoid, c, and activates the expression of hunch- 
back, g; both of these quantities can simultaneously be 
measured using immunostaining and microscopy meth- 
ods. Many nuclei experience the same level of input, c, 
in the same embryo. We can thus ask, for each input 
c, about the mean input/output relation, g{c), and also 
for the noise crg(c), and these two measurements can be 
combined into cFg{g), the noise in hunchback expression 
as a function of mean induction, shown in Fig. |15[ 

All three fits of the theoretical models to the data 
shown in Fig. [15] are two parameter fits. We fit essen- 
tially the same model as that of Eq (461; the only ex- 



0.15 




0.05 



FIG. 15 The noise in the expression level of hunchback, ag, 
plotted as a function of mean hunchback induction g, nor- 
malized to span the range from to 1. Experimental data 
is shown as circles, with error bars denoting the std across 
9 embryos. The solid black line is a fit of the model with 
output and input diffusive noise terms, assuming known Hill 
coefficient of 5 for the hunchback regulation by bicoid. In red 
dashed line, the same fit using the Hill coefficient of infinity 
(step like response). In blue, the fit using the input switching 
noise and the output noise contributions. The black two- 
parameter fit describes the data very well. 



ception is that the noise is derived for a cooperative pro- 
moter with cooperativity of 5, which can be read off from 
the Hb/Bcd input /output relation, see Fig. [8| One pa- 
rameter is always the magnitude of the output noise [the 
prefactor to the output noise term of Eq([46|)] . The sec- 
ond parameter is the magnitude of the input diffusive 
noise (for solid black and red lines), or the magnitude of 
switching noise (blue line). The solid black fit describes 
the data excellently, and the following conclusions can 
be drawn from this study: (i) Qualitatively, input noise 
sources produce a peak in the noise at intermediate ex- 
pressions when one plots (Tg vs g. The input noise goes to 
for zero or full induction. In contrast, the output noise 
increases monotonically with the induction. Therefore, 
whenever the experiment claims a peak in the noise, this 
is likely due to the non-negligible contribution of input 
noise to the total, (ii) The previous intuition also enables 
us to interpret the two fitted parameters. The strength 
of the output noise is proportional to the noise magni- 
tude at full induction, g — 1, while the fitted strength of 
the input noise is proportional to the size of the peak at 
intermediate induction, (iii) We find that the fitted val- 
ues of parameters are consistent with plausible values for 
transcriptional regulation: i.e., the magnitude of output 
noise is related to the number of molecules of hunchback 
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per nucleus (estimated from the fit to be ^ 700 — 4000) , 
and the magnitude of the input noise is consistent with 
the measured bicoid diffusion constant and integration 
time. 



F. Relation to neuroscience 

Processes that underhe the hmits to temporal precision 
of single neurons are physically analogous to the noise 
sources discussed in gene regulation; indeed, the study 
of temporal jitter in neural spiking predates the study of 
noise in gene regulation by several decades. 

A neuron generates the ionic currents by opening 
and closing of ion channels. These proteins have 
voltage-dependent probabilities, p{V), of being open or 
closed. This nonlinear dependence, combined with the 
cable equation for propagating voltage disturbances in 
a medium with some capacitance, gives rise to self- 
excitability and generation of action potentials. 

By making an analogy to the binomial "switching 
noise" in gene regulation, we see that for each ion chan- 
nel, there will be an associated binomial variance in its 
activity, proportional to p{V){l — p{V)y^. This vari- 
ance is reduced because there are many channels in each 
neuron that open and close independently (here we have 
population noise averaging, similar to temporal averag- 
ing in gene expression), but it is not reduced to zero. Us- 
ing patch-clamp techniques it has been possible to elec- 
trically isolate a small patch of the membrane contain- 
ing a very small number of channels, and to observe the 
quant al pulses of current flowing through single channels. 
This randomness in opening and closing of ion channels is 
one of the reasons why the neurons do not respond with 
identical spike trains to presentations of exactly repeated 
inputs. 



IV. INTRODUCTION TO INFORMATION THEORY 

Up to this point we have stressed the role of noise in bi- 
ological networks and mentioned several time that noise 
limits the ability of the network to transmit information; 
in this lecture we will turn this intuition into a mathe- 
matical statement. 

Recall that in our introduction to noise, we started 
with a probabilistic description of an information trans- 
mission system: given some input c, the system maps will 
map it into the output g using a probabilistic mapping, 
P{g\c). In case there were no noise, there would be no 
ambiguity, and g = g{c) would be a one-to-one function. 

Suppose that the inputs are drawn from some distri- 
bution P{c) and fed into the system which responds with 



the appropriate g. Then, pairs of input/output symbols 
are distributed jointly according to 

P{c,g) = P{g\c)P{c) (47) 

In what follows, we will be concerned with finding ways 
to measure how strongly the inputs (c) and the outputs 
{g) are dependent on each other. It will turn out that 
the general measure of interdependency will be tightly 
related to the concept of information. 

A. Entropy and mutual information 

Let's suppose that our information transmission "black 
box" would be a hoax, and instead of encoding c into g in 
some fashion, the system would simply return a random 
value for g no matter the input c. Then c and g would be 
statistically independent, and P{c,g) — P{c)P{g)] such 
a box could not be used to transmit any information. 
As long as this is not true, however, there will be some 
statistical relation between c and g, and we want to find 
a measure that would quantify "how much" can I know, 
in principle, about the value of c by receiving outputs 
g, given that there is some input/output relation P{g\c) 
and some distribution of input symbols P{c). 

The first quantity that comes to mind as the interde- 
pendency measure between c and g is just the covariance: 

Cov(c,.g) = JdcJ dg{c-c){g~-g)P{c,g)- (48) 

it is not hard, however, to construct cases in which the 
covariance is 0, yet c and g are statistically dependent. 
Covariance alone (or correlation coefRcient) only tells us 
about whether c and g are linearly related, but there 
are many possible nonlinear relationships that covariance 
does not detect; for example, see Fig [TBI 

Moreover, we would like our dependency measure to be 
very general (free of assumptions about the form of the 
probability distribution that generated the data) and de- 
finable for both continuous, as well as discrete outputs^"^. 

We will claim, following Shannon, that there is a 
unique, assumption-free measure of interdependency, 
called the mutual information between c and g. Before 
we define it, however, we need to define another quantity, 
called the entropy of a distribution P(c): 

S[P{c)] J dc P{c) log2 P{c) (49) 

To keep things simple, let's for now assume that the 
value of interest is discrete, that is, that c can only take 
on the values Ci, i — I, . . . , K . In that case 

K 

S[P{c)]^-J2P{c^)l0g^P{c^). (50) 
i=l 



In principle, there could also be an associated "shot noise" vari- 
ance in the number of ions that flow through these channels. 



Covariance can be problematic when used on discrete quantities. 
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FIG. 16 Examples of two variables, drawn from three joint 
distributions. Shown are the scatterplots of example draws. 
On the left, the variables are linearly correlated, and the cor- 
relation is close to 1. In the middle, the variables are interde- 
pendent, but not in a linear sense. The correlation coefficient 
is 0, but measures of statistical dependence, such as mutual 
information, give non-zero value. Note that we are looking 
for a general measure of interdcpendency: if we had a model 
that assumes that x and y lie on a circle, we could fit that 
particular model or use a measure that makes the circular as- 
sumption. Instead, we would like to find a measure that de- 
tects the dependency without making any assumptions about 
the distribution from which the data has been drawn. On the 
right, the variables are statistically independent, and both 
linear correlation and mutual information give zero signal. 



If the data were continuous, we would have discretized 

it prior to computing the entropy (we discuss the dis- 
cretization later); the measure that we are after, the mu- 
tual information, will turn out to be independent of dis- 
cretization. 

Entropy can be defined for any distribution. It is al- 
ways positive, measured in bits, and always takes a value 
between two limits: < S[P{c)] < logg K (in discrete 
case). The entropy is zero when the distribution has its 
whole weight of 1 concentrated at a single Cj. The en- 
tropy is maximal when P(c,) = 1/K, i.e. P is a uniform 
distribution. The entropy is a unique measure of uncer- 
tainty about the value of c: the uncertainty is when 
the distribution is peaked at a single value, and maximal 
when all c, are equally likely. 

In one of the fundamental works of the 20th century 
science. Shannon has argued that this quantity, the en- 
tropy, is connected to the amount of information that 
needs to be supplied to specify a particular value of c. 
Suppose c can take on 8 different values Cj = 1, ... 8, and 
P(cj) = 1/8, that is, the distribution is uniform, and by 
our definition, has an entropy of S' = log2 8 = 3 bits. 
I draw a particular from the distribution and don't 
share the value with you. You are allowed to ask a series 
of yes/no questions about the value I chose and your task 
is to determine my choice in as few questions as possible. 
What is the minimum number of questions that you need 
to ask, on average? This turns out to be the same as the 
entropy, and in case when P{ci) is uniform, one of the 
optimal strategics is bisection: "Is the chosen Ci larger 
than 4?" If yes, the next question could be "Is Cj larger 
or equal to 7?" Otherwise, you could ask "Is the chosen 
value less or equal to 2?" and so on, until the correct 
is identified. Bisection is optimal because with a single 



binary question, it partitions the set of possible values 
into two equally likely subsets. To show that the number 
of questions is equal to the entropy even in the case of 
non-uniform P(cj) one needs some extra work, but the 
intuition remains the same. 

Note that the entropy is the minimum number of ques- 
tions, on average - across many repeats of the game. In 
single instances of the game you might be lucky and hit 
the correct number by simply asking "Is the correct value 
Ci equal to 7?". However, this strategy will only termi- 
nate in 1/8 of the games with a single question, and on 
average, it is worse than the strategy of bisection. 

In case of continuous variables, no number of ques- 
tions can pin down the exact value of a real number, 
because, mathematically, real numbers have infinite pre- 
cision. In practice this is not a concern, because the 
precision of physical data is always limited by noise or 
measurement error. We can therefore always discretize 
a continuous measurement into bins with the size of the 
error bar. Formally, all information theoretic quantities 
can be properly defined also for continuous variables, but 
in the interest of clarity we will skip these generalizations 
here. 

To summarize: a "bit" is thus amount information con- 
tained in a single binary response to an optimally posed 
question in other words, a bit is the maximum amount 
of information that a binary question can convey. For 
each distribution P(c) of some quantity c we can define a 
measure called its entropy, a positive number expressed 
in bits, that quantifies the uncertainty about the value 
of c, and is connected to the minimal number of yes/no 
questions that need to be asked to find out the value of 
c on average. 

Let's now return to the original problem, where we 
think about two variables, c and g, jointly distributed as 
P{c,g), and we would like to quantify how statistically 
dependent the two variables are. Let's start by comput- 
ing S[P{g)], and assume that our values are discretized 
(so we are dealing with gj and Cj, where i and j run over 
all possible discrete choices for g and c, respectively): 

S[P{gj)] = ~Y.P(9j)log,P{gj). (51) 
j 

According to what we have just learned, this is the un- 
certainty about the value of g. We can also compute the 
following: 

Sn{ci) = S[P{gj\ci)] = -^P{gj\ci) logs P{gj\ci)- 

(52) 

This entropy still depends on c, because the sum is taken 
only over the possible values for gj . The interpretation of 
this quantity is the uncertainty about the value of g if we 
know the value of c to be Ci. If c and g are related in any 
statistical way whatsoever, we expect that by knowing 
c, our uncertainty about g will be less, on average, than 
if we don't know c. In other words, c will give us some 
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information about g, if there is any statistical interde- 
pendency. In equations, let's define mutual information 
/ as: 



(53) 



The crucial insight is that without any knowledge of c, 
g will have the uncertainty S[P{g)]; with the knowl- 
edge of c, the average uncertainty about g is smaller, 
P{ci)Sn{ci). The difference between these two quan- 
tities is the mutual information, and Shannon has shown 
that this is a unique assumption-free measure of any sta- 
tistical dependency that does not make any assumption 
about the form of the joint distribution P(c, g). 

Mutual information is a number in bits, and can be 
shown to be always positive. Algebraic manipulation of 
Eq ( 53 1 quickly produces a more compact expression: 



Hc;g) = ^P(ci,5j)log2 



(54) 



where P{ci) and P{gj) are marginal distributions of the 
joint P{ci,gj). Note that the mutual information is a 
single number, not a function, although it is convention- 
ally written with the arguments in parenthesis, I{c;g), 
to denote that it is information between c and g. 

Mutual information has a number of very appealing 
properties: 

• It can be defined for continuous or discrete 
quantities. Mutual information is a functional of 
a probability distribution, and probability distribu- 
tions are very generic objects, c and g could both 
be continuous, or any one or both can be discrete. 

• It is reparametrization invariant. Mutual in- 
formation betwen c and g is the same than mu- 
tual information between any one-to-one function 
of c, /(c), and any one-to-one function of g, h{g), 
that is I{c;g) — I{f{c);h{g)). In biological exper- 
iments, this is a great asset, as we will soon see: 
experiments often report, e.g. intensities or log- 
intensities on the microarray chips or in FACS sort- 
ing, and there is a lot of discussion about how this 
data should be normalized and transformed prior 
to any analysis. This is important because some 
statistical measures of correlation, like correlation 
coefficients, depend on it. Mutual information, in 
contrast, is invariant to such reparametrizations of 
the variables. 

• It is symmetric. Mutual information tells us, 
in bits, how much I learn about 5 if I know the 
value of c. As is evident from Eq (54), I{c;g) is 
symmetric with respect to the change in c and g, 
i.e. I{c;g) = I{g; c). This means that I can equally 
well learn about c by knowing the value of g. 

• It obeys data processing inequality. Suppose 
that g depends on c and k depends on g (but not 



4 

3 

ra 2 

1 





4r 

3 . 

m 2 



0^- 



3 

m 2 
1 



FIG. 17 Interpreting the mutual information values. On the 
left, the mutual information between c and <? is 1 bit. This is 
because both c and g have two "states" (each corrupted by 
some noise), and the states are in a one-to-one relationship: 
knowing which state c is in tells me about what state g is in. In 
the middle, the connection between c and g is no longer trivial. 
When c has low value, g can either be low or high. The only 
information we have is that it is impossible for c to be high 
when g is low. This yields ~ 0.25 bits of information between 
c and g. On the right, there are three possible states for c 
that are distinguishable given the noise, and each maps into 
a unique state for g. The mutual information is 7 ~ log2(3) 
bits. 



directly on c), in some probabilistic fashion. In 
other words, one can imagine that there is a Markov 
process, c g k, where arrows denote a noisy 
mapping from one value to the next one: c gives 
rise to g and g to k. Then /(c; k) < I{c; g), that is, 
information necessarily either gets lost or stays the 
same at each noisy step in the transmission process, 
but it is never "spontaneously" created. 

There is a number of powerful theorems relating to 
mutual information which we will not go into here, but 
the interested reader is referred to the classical text of 
Thomas and Cover for details. 

To summarize. Shannon has shown that there is a 
unique, positive, assumption-free measure of statistical 
interdependency of two variables c and g that is called 
the mutual information and which is given by Eq |54| 
This measure is zero if and only if the two variables are 
statistically independent. It does not describe what kind 
of interdependency there is between the two variables, 
rather, it only quantifies how much of interdependency 
there is, in bits. 

Since this is a measure over probability distributions, 
it is extremely generic. Before focusing on one in-depth 
example in Section |Vj let's first enumerate a number of 
possible applications relevant to biology: 

First, c and g could be expression levels of different 
genes. The levels can be statistically correlated because 
the genes are regulated by a common transcription factor, 
because they regulate each other, or because their expres- 
sion is modified coherently by some other cellular prop- 
erty (such as volume change) . Mutual information can be 
used to measure the dependency between the two expres- 
sion levels. In particular, in microarray experiments one 
measures the simultaneous expression of many (possibly 
all) genes, across a range of conditions, such as changes 
in nutrient concentrations, pH, temperature etc. Then, 
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we can construct a similarity matrix oi K xK mutual in- 
formation values between all K genes, computed across 
all experimental conditions. This matrix measures the 
degree to which the genes are coexpressed. Commonly, 
correlation coefficients of log intensity values are used for 
this purpose, but it has been shown that mutual infor- 
mation can discover nonlinear dependencies, such as that 
in Fig. |17[ which the correlation measure will miss. 

Second, in these lecture notes we have set up an ex- 
ample where the transcription factor c regulates the ex- 
pression of gene g. Computing the mutual information 
between c and g then can answer, in a mathematically 
precise way, an interesting question: "Are genetic regu- 
latory elements binary switches (i.e. do they have the 
capacity of 1 bit) or are they able to convey more than 
1 bit of information, and if so, how much?" We explore 
this in Section IVIII 

Third, one could ask about how strongly any DNA se- 
quence, for instance that of a TF binding site, influences 
the expression level of a reporter gene. This is an in- 
teresting problem that is looking for a relation between 
a short segment of DNA sequence of length L, s, and 
the expression level g. Using the usual measures of de- 
pendency we might face practical difficulties because the 
sequences live in a space of s G {A, C, T, G}^, whereas 
the expression levels are real-valued measured quantities, 
g € 3?. Since mutual information is defined on probability 
distributions and P(s, g) is a well-defined object, finding 
I{s;g) is possible. We explore this in Section [V] 

Fourth, we could ask about the information between 
the presence or absence of a given gene, or set of genes, 
and the phenotypc of the organism (Slonim et all|2006|. 

Fifth, in visual neuroscience, neurons in the retina are 
sensitive to spatio-temporal correlations of light; when 
exposed to the preferred stimulus, the neuron might 
spike. How much information does a spike carry about 
the neural input? Again, answering this question re- 
quires us to find statistical dependency between the stim- 
ulus and a point object in time, such as a spike, and 
/(spike; stimulus) is a natural way of quantifying this de- 
pendency. 

Sixth, one could use information as a measure over 
discrete objects, such as sequences. For instance, how 
much information does a single base-pair carry about 
whether the region of the genome in which the base pair 
is located is a coding vs noncoding region? How much do 
pairs, or consecutive triplets of base-pairs carry about the 
coding vs non-coding region? Another possible example 
would be to compute the mutual information between 
the sequence snippets of related organisms, /(si,S2), as 
a measure of evolutionary distance. 
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B. Estimation techniques: an example 

To be concrete, let's focus on a real da taset 
was reported in Ref. (Sachs et al 2005), 
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FIG. 18 Raw data of Ref ( [Sachs et ail|2005[ ). The activation 
level (grayscale) of 11 signaling proteins (vertical a^cis) was 
recorded in ~ 7000 single cells. Samples are collected into 
consecutive blocks, for each of the 9 conditions (horizontal 
axis) . 



human immune system cells. To this end, they designed 
probes against specific phoshorylated (or otherwise acti- 
vated) forms of 11 signaling proteins in the network; each 
probe was tagged with a fiuorophore of a different color. 
Many cells were fixed and stained, and ran through the 
FACS machine to obtain simultaneous readouts of the 
activation levels of 11 proteins, for each cell. The mea- 
surements were done in 9 conditions Ck, k — 1,...,9, 
and in each condition a sample of around ^ 600 cells was 
recorded. Conditions differed in the chemical environ- 
ment that the cells were exposed to: in some conditions, 
naturally occurring ligands were presented, while in the 
others, artificial blockers or activators, specific for some 
of the signaling proteins in the network, were applied. 

For our purposes, the experiment provides us with a 
dataset of ~ 7000 samples, where each sample is a simul- 
taneous recording of 11 activation levels, gi, i — 1, . . . , 11; 
see Fig. [T8| 

Let us first quantify how correlated are pairs of ele- 
ments in this signaling network, across all conditions pre- 
sented; this is similar to the procedure that would be used 



The data 
where the au- 



in microarray experiments. Figure 19 shows the pairwise 
correlation coefficients using two different normalization 
schemes. One immediate problem that we face is that the 
correlation values strongly depend on the normalization 
of our data. More problems, however, are revealed when 
we look at the histograms and scatter-plots of the activi- 
ties ~ Fig. 20 shows that the histograms P{gi) have very 



thors studied a MAP kinase cascade signaling network in nontrivial, multi-peaked structure, and the scatterplots 
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FIG. 19 The 11 x 11 matrix of correlation coefficients be- 
tween raw values reported in the experiment (left) and the 
log-transformed values (right). Note that the two matrices 
are significantly different. We also see that the activity of 
each network element is (significantly, not shown) correlated 
with most of the other elements. 
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FIG. 20 The non-Gaussian nature of the data. On the left, 
the histogram of the log-activity level of signaling protein g2 
across all 9 conditions. We see 3 distinct "activation states" 
as peaks in the distribution. On the right, a scatter plot, 
across all conditions, of the simultaneously recorded levels of 
g2 and g^, showing a very nonlinear dependence. 



of pairs of activity levels reveal very non-linear depen- 
dence. 

Our first task will be to measure the pairwise depen- 
dencies using the mutual information measure of Eq ( 54 1 
To compute the mutual information, the equation in- 



structs us discretize both c and g into b bins, histogram 
P{c,g) from the data {P{c,g) will be a & x 6 matrix), 
and from there compute /(c; g) in a straightforward way 
- we will refer to such direct calculation with binned and 
sampled data the naive estimator with TV samples and 
b bins, and denote it by Ip{,i,{c;g). Two problems, how- 
ever, are in our way: (i) We need to discretize (bin) the 
data and there is a question of where to draw the dis- 
cretization boundaries to gain the best statistical power; 
(ii) It can be shown that mutual information is a sensi- 
tive statistic to compute; in particular, it gives a biased 
estimate of the true value when evaluated on probability 
distributions sampled from any finite dataset. This hap- 
pens because the empirical probability, P{c,g), that is 
obtained by sampling, has bins with small (or even zero) 
counts. 

One of the most straightforward ways to correct for 
this bias is the so-called direct estimation method, which 
addresses both issues (i) and (ii) simultaneously. The 



crucial realization was to derive the small-sample effect 
on the naive estimator: 



lN,b{c;g) = /co.b(c;.g) + ^M^) + 



(55) 



This equation says that if we hold the number of bins b 
fixed and change the number of samples, the naive esti- 
mator and the true value (which we would get if the num- 
ber of samples were infinite), differ by a bias that scales 
as We can use the equation as a prescription for 

getting rid of the bias: if our true dataset is of size iVtot, 
we can subsample the data at fractions of the total size, 
for example at iV = 0.77Vtot, O.STVtot, O.gA^tot, 0.957Vtot, 
many times, and compute an average naive estimator at 
each fraction of A'tot- With these estimators in hand. 



we can use linear extrapolation in from Eq (551 to 



obtain an unbiased estimate of information with b bins, 
/oo, b(c;g). What remains to be done, then, is to choose 
a correct number of bins for discretization. With too 
small a number, we will lose the structure in the joint 
distribution - e.g. if one only discretizes into 2 levels, for 
instance, fine scale details in P{gi,g2) of Fig. 20 might be 



lost. If one discretizes into too many bins, however, then 



the linear correction term in Eq ( 55 1 will no longer suffice 



to counter the sampling problems, and our estimates will 
be wrong. For a reference on direct estimation technique, 
consult Ref. (iSlonim et all 120051). 



Figure [21] documents the direct estimation procedure. 
We see that as the number of bins b grows, we capture 
more and more information in the data, and the slope of 
the extrapolation line [and thus finite-size corrections of 
Eq (55)] are increasing in size. If we used 6 3> 50 bins, 
the extrapolation would break down and we would lose 
control over information estimation. 

If the data were inherently discrete, then the technique 
is more straightforward: one only does the 1 /N extrapo- 
lation to correct for the sample size at the given number 
of bins that correspond to the number of discrete levels 
in the data. 

Finally, we can apply the estimation procedure to the 
MAP kinase signaling network dataset to extract the 
11 X 11 pairwise mutual- information matrix, summariz- 
ing the statistical dependencies between all pairs of ac- 
tivities gi, gj, see Fig. 22 Interestingly, for example, the 



mutual information analysis reveals that the pair (geiffs) 
has about 0.4 bits of mutual information, yet the pair- 
wise correlation coefhcient is only 0.019, signaling that 
most of the statistical dependency is not linear. 

There are many other ways to estimate mutual infor- 
mation differing in how they handle the small-sample 
bias, which is the biggest technical difficulty with esti- 
mations of this sort; the naive estimators are very quick, 
but resamplings necessary to handle the bias lengthen the 
computational time considerably. Nevertheless, mutual 
information has been used to compute pairwise similarity 
matrices b etween all pairs of ge nes in high throughput ex- 
periments (Slonim et al 2005), and overall, this statistic 



is gaining in popularity in life sciences. 
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FIG. 21 The direct mutual information estimation procedure. 
This is an example where we compute the information be- 
tween the activity level of signaling proteins gi and g2. At 
each number of bins, 6 = 5, 10, . . . , 50 (different lines), one 
subsamples repeatedly the whole dataset of 5400 samples into 
70%, 71%, . . . , 99% of the samples, 500 times. For each frac- 
tion of the data, 500 naive estimators /jv,b are thus computed; 
their mean and std are plotted as black points with error bars. 
For each fixed number of bins b, naive estimates at different 
fractions of the data A'^ fall onto a line when plotted against 
1/N. The linear extrapolations are shown in red, and the 
extrapolation is from the finite dataset into 1/N — >■ 0, i.e. to 
the limit of infinite data, which is the intercept of the red line 
with the vertical axis. As the number of bins is increased, 
we get higher and higher information estimates, until they 
start converging to the same value of I{gi; g2) ~ 1.33 bits, at 
b = 50 bins. 



C. Generalizations of mutual information 
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FIG. 22 A matrix of pairwise mutual information values 
I{gi;gj), in bits (color), between pairs of activation levels of 
signaling proteins in the MAP cascade. The direct estima- 
tion procedure was used, with the maximal number of bins 
b = 50, and extrapolation in Eq. 1 55 1 was used to correct for 
finite sample size effects. 



FIG. 23 A synthetic dataset where 50 samples for gi and 32 
(top two rows) are drawn independently with probability 0.5 
for being or 1, and g^ = XOR((;i, 32). If one looks at any 
pair {gi,gj), there is no pairwise dependence, because all 4 
combinations 00,01,10,11 happen equally often. However, 
when one looks at the triplet (ffi, 52, ffs), there is a clear de- 
terministic dependence of gs on the (?i,(?2. 



Before concluding this lecture, let us discuss two gen- 
eralizations of mutual information. The first generaliza- 
tion extends the measure to include higher-than-pairwise 
structure. Suppose that we have three interacting el- 
ements, gi, 52, and 33. There are 3 pairwise mutual 
informations that one can compute, I{gi;g2), I{gi',g3) 
and I{g2i93), describing pairwise statistical dependence. 
However, there might be statistical dependencies be- 
tween three elements of the network that no pairwise 
measure can detect. The simplest example can be con- 
structed if gi, (72 J 93 are binary variables, related by 
gs — X0R{gi,g2), where the binary function XOR re- 
turns 1 exactly when gi and 32 are different, and oth- 
erwise; see Fig[23l 

In order to detect this higher-order dependence, we 
need an information-theoretic measure that generalizes 
mutual information. This measure is called the multi- 



information, and is defined as follows: 

/({ff J) = E P(i9^}) log2 J}^^^ (56) 

where the numerator contains the joint distribution over 
M elements, gi, . . . , g^/i and the denominator contains 
the product of marginal distributions. Multi-information 
captures all statistical structure between pairs, triplets, 
up to the complete statistical correlation between all M 
elements. While powerful, this quantity is usually very 
hard to estimate because one would need to sample the 
full joint distribution over M elements. We can, how- 
ever, restrict ourselves to triplets of elements. For the 
synthetic XOR example we would find that the multi- 
information is 1 bit (because if (?2, 53 were completely 
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random, uncorrelated binary variables, they would have 
3 bits of entropy; however, only gi and (72 are randomly 
and independently drawn, while is a deterministic 
function of both, so the true entropy of the joint dis- 
tribution is only 2 bits; this reduction of 1 bit is the 
multi- information). In case of the MAP network data, 
the full multi- information would require the 11 dimen- 
sional joint distribution, but we could restrict ourselves 
to triplets {gi-,gj,gk) and ask about how much interde- 
pendency there is in such combinations. 

Up to now, we have been looking for statistical struc- 
ture among {gi} irrespective of the conditions, C. We 
could also ask about the mutual information between 
the activation level of protein i and condition C, that 
is, I{gi;C), using the estimation techniques already de- 
veloped. When we consider such condition (or stimulus) 
dependence, there is also a new set of statistical questions 
that we can ask. For two signaling proteins, gi and gj, 
we can compute separately I{gi]C) and I(gj;C), respec- 
tively. But we could also ask about I{{gi, gj};C): how 
much information does a pair (gi^gj) of two activation 
levels together tell us about the condition C. In principle, 
this can be more or less than the sum I{gi\C) + I{gj;C). 
We define the quantity R as: 



V. USING INFORMATION THEORY FOR INFERENCE: 
TRANSCRIPTION FACTOR - DNA INTERACTIONS 

In our toy example of transcriptional regulation we 
have been assuming that transcription factors bind to 
a well-defined "binding site" somewhere on the DNA. 
But what distinguishes the particular binding site ~ a se- 
quence of 10-20 nucleotides - from all other possible short 
sequences in the genome? How does the TF molecule find 
the correct binding site? 

There is a substantial amount of existing work address- 
ing each of the two questions which are stated more pre- 
cisely below: 

The "specificity problem" arises because the cor- 
rect statistical mechanics problem for TF binding is not 
only that of a single (specific) site being occupied or 
empty, as schematized in Fig. [3] Instead of a single 
site, there is really a large number M of sites on the 
genome, and they have a distribution of binding ener- 
gies p{E): most likely, the specific site is one of the best 
binders (having the most negative i?), but there might 
also be some non-functional sites with low energies as 
well as a huge number (millions for a prokaryote, or bil- 
lions for an eukaryote) of spurious non-specific sites that 
the TF molecule could bind weakly. Our partition func- 
R = I{g^\C) + /(.gj;C) - /({g„gj};C); (57) tion should reflect this: Z = J2fLo e"'^'^'-''), where the 

sum is taken across all the sites in the genome (including 
the specific site that we denote as having the energy Eq). 
The real question is as follows: how does the cell make 
sure that the TF spends most of the time occupying the 
(functional) binding site, and not sitting wastefuUy on a 
large number of non-functional traps? Clearly, the bind- 
ing energy to the specific site must be much stronger 
than to the non-specific sites, Eq ^ Ei,i 7^ 0. But 
the sites only differ in their sequence of nucleotides s, 
so there must exist an "energy function" E{s) such that 
E{so) ^ E{si),i ^ 0. What are the energy functions for 
real transcription factors? 

Tiie "searcli problem" arises when we realize that 
even if equilibrium occupancies were to work out and the 
specific site is occupied with larger probability than non- 
specific sites, there remains the question of the speed of 
equilibration. To equilibrate, TF molecules in the nu- 
cleus they must sample various sites on the genome and 
must therefore physically move from site to site on the 
DNA. Because the translocation of TFs is driven by ran- 
dom diffusion, this puts a computable upper bound on 
how quickly the sites can be sampled and how quickly 
the system can equilibrate. A lot of excitement was gen- 
erated when it was observed that some transcription fac- 
tors can find their sites faster than predicted given the 
3D diffusion limit; more complex modes of TF translo- 
cation were proposed, including sliding and hopping of 
transcription factor molecules along the ID contour of 
the DNA. It seems that a combination of ID and 3D dif- 
fusion can reconcile the measured rapid TF search times 
with the theoretical expectations ( Slutsky et al 2004 1 . 



when R is negative, the pair of activity levels together 
is more informative about the condition than both lev- 
els considered separately; {gi,gj) are said to be synergis- 
tic. Alternatively, when R is positive, gi and gj are not 
providing independent information about the condition 
- they are redundant. 

Synergy and redundancy are extensively used in neu- 
roscience to ask how groups of neurons together encode 
the stimulus (stimulus in neuroscience is analogous to 
conditions C in our example) , as compared to how single 
neurons on their own encode the stimulus. In our MAP 
kinase example, we find that pairs of activity levels pro- 
vide redundant information about the condition. Further 



details can be found in Ref ( Tkacik 2007 1 . 



D. Use and interpretation 

Information theoretic measures, such as mutual in- 
formation, multi-information, redundancy, synergy (and 
others that we don't discuss here, such as KuUback- 
Leibler divergence, Jensen-Shannon divergence etc) pro- 
vide a very powerful, assumption-free framework for dis- 
covering statistical dependencies in the data. There ex- 
ist systematic approaches that discover correlations, both 
linear and non-linear, between pairs of elements, triplets, 
quadruplets etc (Schneidman et al 2003); we are usually 



limited by the availability of data and run into sampling 
problems for such higher-order dependencies, but in prin- 
ciple they could be computed. 



Here we will focus on the first problem of specificity in 
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TF-DNA interactions. In particular, we will discuss how 
information theory can be used to infer the energy func- 
tion, E{s), for a given transcription factor whose binding 
has been probed in various high-throughput assays. 

Why is the problem of DNA-TF interaction important? 
If we want to ultimately understand genetic regulatory 
networks, we need to know which transcription factors 
bind where in the genome - in particular, which sets of 
genes they are regulating. The latter question can be 
answered if one knows the energy model for TF-DNA 
interaction and is in possession of the complete genome 
sequence. Whole genome sequences are available today, 
so the difficult part of this program is learning the model 
of TF-DNA interaction from various datasets, such as 
protein binding microarrays, chromatin immunoprecipi- 
tation assays, microarrays and high throughput sequenc- 
ing techniques. 



contributions are parametrized by the energy matrix Sab 
of dimension L x 4, where each entry in the matrix at 
position i specifies how much energy is contributed by a 
particular base b — s{a) at that position, as shown in 
Fig.p 

This independent energy matrix model is likely not lit- 
erally true, but the number of free parameters L x 3 can 
be small enough so that reasonable energy matrix mod- 
els can be fit from available data"'^*. Models that include 
higher-order contributions to the energy are more realis- 
tic, but the number of parameters explodes. Moreover, 
the simple model has had a number of successes in pre- 
dicting TF binding sites, so we adopt it here. 

B. Connection between energy and position weight 
matrices 



A. Energy matrices 

We start by giving a brief overview of how the inter- 
action between TF molecules and DNA has traditionally 
been described and inferred from data. We continue by 
pointing out the flaws in the traditional approach and 
show how it can generate biased models of TF-DNA 
interaction. We end by proposing a new information- 
theoretic inference method that can avoid these problems 
if enough data is available. 
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FIG. 24 A schematic representation of the interaction be- 
tween a transcription factor molecule (ellipse) and a site on 
the DNA of length L nucleotides and sequence s. The en- 
ergy of this interaction E{s) is given by an energy matrix 
of dimension i x 4, here represented as the matrix in the 
TF molecule. Each nucleotide s{a) — {A,C,T,G} in the se- 
quence contributes independently to the total binding energy, 
which is a linear function of sequence: E{s) = X]a=i ^as{a)- 
At each position a we look up the base at that position 
b = s{a) in the short sequence, then look up the correspond- 
ing entry in the energy matrix eat, and add it to the total 
binding energy. 



The simplest model of TF-DNA interaction is the 
so-called energy matrix model, shown in Fig. [24] 
In this model, TF binds short sequences s — 
{s(l), s(2), . . . , s(L)} of length L on the DNA. Each base 
pair in a short sequence, s{a) = {A, C, T, G}, contributes 
independently to the total binding energy. These energy 



When no high-throughput experiments were available, 
the primary data that could serve as input for inference 
of energy matrices were lists of known and experimen- 
tally verified binding sites. Frequently, these lists were 
incomplete, both because it was time-consuming to probe 
many candidates, and because the experimentalists pre- 
ferred to set stringent criteria for a "true" site and thus 
avoid the controversial issue of possible weak, but func- 
tional, sites. 

Suppose a list of M known binding sites, {si}, i = 
1, . . . , M, is given. Then, there exists a concise summary 
of TF sequence preference known as the position weight 
matrix, or PWM: 



1 



M 



i=l 



(58) 



that is, the PWM is simply a frequency table of how often 
a particular base b appears at position a = 1, . . . , L in 
the set of known binding sites. In a seminal paper, Berg 
and von Hippel have shown that under certain conditions 
there exists a simple relation between the PWM of a 
given transcription factor, and its energy matrix (Berg 
et al 



1987) 



Cab = - \0g Wab 



(59) 



up to the arbitrary energy offset in each row of the en- 
ergy matrix, and the overall unit of energy (scale factor) . 
Since this paper has appeared, the connection between 
PWM and energy matrix has become the cornerstone of 
inferring energy matrices: first, a list of known (putative) 
binding sites is produced, a PWM is extracted from it 
and the energy matrix is constructed using the equality in 



One can subtract a constant from each column of the matrix 
without affecting the binding because energies are significant 
only up to an overall additive constant factor. This is why the 
total number of free parameters is not L X 4, but L X 3 instead. 
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Eq ( 59 ) , and the list of putative binding sites is then usu- 



ally refined in some iterative procedure that involves ex- 
perimental data. Even when the experimental techniques 
progressed and the data was not restricted to short lists 
of verified binding sites, most inference procedures still 
relied on the Berg-von Hippel equality. Moreover, PWM 
slowly emerged as the relevant object that characterizes 
TF-DNA interaction, together with the picture that one 
must look for statistical signals in the promoter regions 
of the genes that signify sequences different from some 
null background expectation for what a non-functional 
sequence should look like. This "statistical view," which 
regards TPs as objects that look for "patterns in the se- 
quence," can be deceiving, if one forgets that TPs are 
not algorithms, but physical objects, and therefore must 
be described by physical quantities: energy of DNA-TF 
interaction is such a quantity, while a PWM is not. 

What are the assumptions that must be true in or- 
der for Eq (59) to hold? (i) The true binding sites are 



(a) 
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embedded into genomic background which is large and 
where bases are used independently at each site; (ii) The 
true binding sites have sequences that are as random as 
possible (maximum entropy), with the only constraint 
that the average binding energy of the sites in the known 
sites list is fixed, E. In other words, this assumption 
states that the only distinguishing feature of functional 
sites is that their binding energy is approximately E, pre- 
sumably lower than the other nonfunctional sites in the 
genome; (iii) The concentrations of TP are not such that 
some sites would be fully saturated; (iv) The known sites 
list is complete, or at least an unbiased sample of the true 
binding sites. 

What role do these assumptions play in combination 
with modern data sets when inferring TP energy ma- 



trices? Figure 25 shows data from one high-throughput 
experiment that can probe simultaneously the binding 
of a single transcription factor to all intergenic regions in 
yeast (see figure caption for details). Usually the analysis 
proceeds by thresholding the dataset to isolate sequences 
in which there are binding sites (true positives) and to 
minimize those sequences mingled in that do not have 
a binding site yet pass the threshold (false positives). 
Data above threshold is retained, while data below it is 
discarded. With the data above threshold in hand, some 
model is assumed that links the direct experimental read- 
out (e.g. the light intensity in the PBM chip) with the 
putative binding of the transcription factor somewhere 
in the corresponding intergenic region. Often, an initial 
guess will be made for the energy matrix, which will then 
be scanned across the sequences passing the cut to iden- 
tify possible 'hits' in those regions, i.e. sites of length 
L that score well with the assumed energy matrix. The 
predicted hits will give rise to predicted light intensity, 
which can be compared to the true intensity, and an up- 
date in the energy matrix guess can then be made. Most 
often, the iterative step will make use of Eq ( 59 ) to im- 
prove the guess of the energy matrix. 

Our motivation for devising a new method for inferring 




FIG. 25 Experimental results from PBM (protein-binding- 
microarray) experiments of Mukharjee et al (Mukherjee et 
lal) 120041). In PBM assay, intergenic double-stranded DNA 



of yeast was spotted onto the array. Transcription factor of 
interest, Abflp in this case, is introduced to the array and 
left to bind; there should be higher probability of binding 
somewhere in those intergenic regions that contain the actual 
Abflp binding sites. The TFs are fluorescently tagged. Each 
spot in the array, corresponding to one particular intergenic 
region in yeast, therefore provides a light intensity readout 
related to whether Abflp is bound somewhere in that region 
or not. In this plot, the histogram of log light intensities for 
every spot in the PBM chip is shown, together with the ex- 
perimentalist's stringent cutoff (green vertical line) that sep- 
arates true positives (intergenic regions with TP bound) from 
the rest; the cutoff was chosen to minimize the amount of 
false positives. 



TF-DNA interactions was based on the following obser- 
vations about the existing approaches: 

(i) The assumptions underlying the Berg-von 
Hippel equation that links PWMs v^rith energy 
matrices might not hold. 

In particular, the genomic background is not always 
simple, as has been amply demonstrated by failed at- 
tempts to identify transcription factor binding sites in 
Plasmodium falciparum, the malaria parasite; this para- 
site has non-coding regions with a lot of statistical struc- 
ture, including complicated repeats, and simple models of 
genomic background fail to capture these dependencies, 
leading to a failure in algorithms that exploit Eq ( 59 ) to 



search for binding sites (Elemento et al 2007). Moreover 



the concentration of TFs in the experiment may be satu- 
rating for some sites, which is problematic for the original 
Berg-von Hippel formulation, but has been addressed by, 
e.g. Ref (Djordjevic et al[[2003|. 



(ii) Discarding most of the experimental data 
(e.g. measurements that He below the threshold 
of light intensity in the PBM example) is waste- 
ful. 

The experimental observation that the TP does not 
bind in some intergenic region is in fact as informative as 
the observation that it does bind in certain other regions. 
In other words, these "negative" samples inform us about 
what the energy matrix cannot be (because with a wrong 
model, we could predict binding in those regions that in- 
deed do not show any binding experimentally), and thus 
are informative about the model as well. 

(iii) We don't know the "error model" of the 
experiment. For a principled, unbiased inference of 
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any model from the data (including TF energy mod- 
els), we would have to write down the likelihood of the 
observed data given the model, P(data| model). To do 
proper Bayesian inference, we would then maximize this 
likelihood with respect to the model parameters (e.g. the 
energy matrix) However, in most of the experiments, 
we have no idea about what form the error model - the 
probability P - has. That is because experiments such 
as binding arrays, microarrays etc usually involve many 
complicated biochemical and detection steps, such as hy- 
bridization, washing, sonication etc, and therefore there 
is no principled way of writing down the probability of 
raw experimental readout (i.e. light intensity) given that 
the TF is or is not bound. Without the knowledge of 
P and the subsequent inability to do proper Bayesian 
inference, most researchers have resorted to ad hoc ap- 
proaches, such as setting thresholds. Moreover, these 
ad hoc approaches usually contain many algorithmic de- 
tails, involving data normalization and representation 
(e.g. does one analyze light intensity, or log light inten- 
sity), that can strongly influence the results. We would 
like to reexamine the premise that unbiased inference is 
impossible without the explicit knowledge of P. 

(iv) Analysis of the same TF using different 
methods often gives inconsistent results. 
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FIG. 26 Comparison of experimental results about the bind 
ing of Abflp in yeast using two assays, protein binding mi 
croarrays (PBM) from Ref. ( |MukIierjee"etai|[2004| ), and cliro 



matin immunoprecipitation (ChIP chips) from Ref. ( Lee et al 



2002 1. In the bottom, the raw experimental results are shown 



with the thresholds determined by the experimentalists. The 
circles show the number of bound regions identified by the two 
experiments along with their intersection. Figure reproduced 
from Ref. (iKinney et all|2006l). 



talists to select regions that are true positives, the inter- 
section of regions declared bound by PBM and ChlP-chip 
experiments is rather low. While such inconsistent results 
are not rare in the field, it is not clear how to interpret 
the inconsistency: as the experimental imprecision, dif- 
ference in experimental conditions (including the state of 
the yeast culture), biased inference of the bound regions, 
biological noise, etc^^. 

(v) We would like to put proper error bars on 
any inferred model. 



As Fig. [26] shows, despite best efforts of the experimen- 



In a paper by Kinney et al (Kinney et al 2006), we 
argued that some of these problems are related: for in- 
stance, if we could compensate properly for our lack of 
knowledge about the error model, perhaps the inferred 
binding sites from different experiments could be made 
consistent; likewise, if we did not need to assume any- 
thing about the statistics of the background sequence, 
perhaps the existing experiments would reveal binding 
sites in Plasmodium parasite. In the next section, we 
present an information-theoretic approach to modeling 
transcription factor - DNA interactions that will address 
some of these concerns. 



C. Mutual-information based inference of TF energy 
matrices 

We start by representing data from a typical high- 
throughput experiment in a new form. Suppose that a 
high-throughput experiment probes sequence fragments 
Si- These fragments can be longer than the suspected 
binding site size, which we assume is of length L; for ex- 
ample, sequence fra gments Sj could be i ntergenic regions 
in yeast, as in Ref. ( Kinney et al[ 2006). We don't know 
where in these sequences the binding sites are, if they are 
present at all, or how many binding sites there might be 
for the TF of interest. 

The experiment provides us with a raw experimental 
readout that corresponds to each intergenic region. In 
protein-binding microarrays, this is the light intensity 
level that is correlated with the probability that a flu- 
orescently tagged TF is bound in that intergenic region. 
Similarly, we can use the ChIP arrays. But the frame- 
work is more flexible: any quantity, either continuous or 
discrete, that is experimentally accessible and is thought 
to correlate with the binding of TF of interest, may be 
used; this in principle includes a combination of such 
measured quantities. To be concrete, consider a set of mi- 
croarray experiments where expression levels of genes are 
probed at many different conditions. These results are 
then used as input to clustering, and genes are grouped 
into co-regulated clusters. For example, there is a group 



More precisely, we could include a prior and maximize the poste- 
rior, which would be the really correct way of doing things, but Generally, playing with the thresholds in order to maximize the 
this does not change the gist of our discussion. intersection does not work well either. 
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of genes that is determined by clustering to be in group 
G. We can now assign to every intergenic region a value 
0, if the corresponding downstream gene is not in Q, and 
value 1, if the corresponding gene is in Q. Instead of 
PBM or Chip experimental data, in this case we would 
use the result of the clustering partition as the variable 
(binary in this case) that correlates with the presence or 
absence of a TF binding site. 

Before we proceed, we will bin (or discretize) the data 
into a number of discrete bins, if the data is continuous. 
This discretization is in principle arbitrary, and for best 
results one should make a tradeoff between increasing the 
number of bins (to increase the resolution of the method) 
and the sample size (so that the number of samples in 
each bin is large enough). In PBM experiments, for in- 
stance, the log light intensity levels are discretized into 
50, 100, . . . , 200 bins (to show that the results will not 
depend on this discretization). The bins group together 
log light intensities that are similar. We will use this dis- 
cretized data to estimate mutual information; as shown 



in Section IV mutual information is insensitive to mono- 



tonic transformations of data, and this is exactly what 
we want - our inference will be insensitive to whether we 
use log light intensity, raw light intensity, or any nonlin- 
ear function thereof. We have also mentioned that in the 
limit of large enough data set, the way we discretize does 
not influence the computed mutual information, so our 
method is robust with respect to this choice^^. 

To summarize, the input for our analysis will be pairs 
of (sequence, data bin), where sequences may or may 
not contain binding sites for the TF of interest, and the 
data is any experimentally determined quantity that is 
though to have a statistical dependence with the pres- 
ence or absence of TF binding sites. In particular, af- 
ter the discretization, our analysis does not even require 
the raw data values any longer, just the data bin into 
which the data was assigned on discretization. Let us 
denote the input data V mathematically as K pairs in- 
dexed by i = 1, . . . , if pairs: V = {{si, z^)}, where Si are 
the sequences potentially containing the binding sites, 
and Zi are the (discrete) bins associated with those se- 
quences and summarizing the experimentally measured 
values that correlate with TF binding. 

We now explain the core of the argument on which 
our inference is based. We assume that the experimental 
results Zi are some unknown function of the sequence 
Si, but the statistical dependence between the two can 
only be established through binding energy E of TFs to 
sites in the sequence Si. Suppose that there are many 
subsequences in Si of length L, where the TF could bind. 
If I knew the correct energy matrix e for the TF, I could 
evaluate the energy of the TF to bind onto each site in 
Si. In the simplest case, I could declare that if any of 



these binding energies is below some threshold, the TF 
will bind in that intergenic region (at least once), and I 
would declare that intergenic region bound, = 1. If 
there is no such sequence of length L in Si , then I would 
declare the intergenic region unbound, Xi = 0^^. 



Formally, the argument we are making can be repre- 
sented as follows: 



Si ^ Xi 



fiE{s,e))^z,, 



(60) 



that is, the sequence determines the energy of binding 
(that depends on the energy matrix e which we would like 
to infer), and the energy determines whether the region is 
bound or not Xi = 0, 1; that alone determines the exper- 
imental data bin Zi reflecting the measurement. There 
is no other statistical dependence between the sequence 
and the experimental data, on average, than through the 
binding energy^^! 



With these remarks in mind, a typical representation 
of a dataset might look as shown in Fig. [27] 



In practice, we do need to worry about the small-sample effects, 
but in cases that will be discussed these can be addressed. 



The details of how to deal with multiple bound sites in the same 
region, or whether to use a hard threshold to declare a site being 
bound or a soft threshold where "being bound" i s a real num- 
ber denoting probability, can be looked up in Ref. ( [Kinney et al[ 
[2006 [ l and its supplement. Interestingly enough, if mutual infor- 
mation is used to infer the energy matrices e, the matrices will 
be largely independent of these assumptions. Here, concretely, a 
simple rule was used: apply e to any site of length L within the 
region Si. If the energy of binding is favorable, i.e. below thresh- 
old 0, declare that site bound, and declare the whole region 
bound, Xi = 1. If none of the energies is below the threshold, 
the region is unbound, Xi = 0. The matrix e can only be deter- 
mined up to a scale, i.e. Eif, and Ae^j,, where A is a positive scalar, 
are both equally good guesses (in this setup it can be shown that 
the experimental data cannot predict the absolute scale of the 
matrix in physical units of, say, joules). We can get rid of this 
arbitrariness by fixing the threshold for binding, © = 1. 
In each particular sequence, there might be other statistical de- 
pendencies, such as binding of other transcription factors, chro- 
matin influences etc, but those unobserved influences will act 
as noise in our inference. Since we are not assuming any noise 
model, they should not bias the inferred energy matrices. 
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K pairs (sequence, data) [K=5812 in the example], and 
using Eq. (54) to compute the information^^ . 



FIG. 27 An example dataset for inferring the TF energy ma- 
trix. Intergenic regions in yeast are ordered, i = 1, . . . , 5812. 
If we make a guess at the energy matrix e, we can decide 
whether each intergenic region is bound or not (see text), 
and assign Xi = 0, 1 to each region. In addition, each re- 
gion has produced some experimental readout, that has been 
discretized and so each region is assigned to one of the bins 
Zi = 1, . . . , 250 (maximum number of bins is 250 in this ex- 
ample). The table illustrates that there is some statistical 
dependence between the region being bound {xi — 1) and 
being assigned into a bin corresponding to e.g. higher light 
intensities on the PBM chip (higher corresponding bin num- 
ber Zi). This table can be constructed if some energy matrix 
e is assumed, because Xi is a function of sequences Si and e. 



If the experimental result depends only on binding, and 
the binding depends on the sequence through binding en- 
ergy alone as in Eq. ( 60 1 , we can formulate the following 
inference principle: 



e* = argmax^/ (xj(e, si); Zi) , 



(61) 



where e* is the energy matrix that we are looking for, 
and argmax is returns that e that maximizes the mutual 
information /. Let us now parse this equation in detail. 

We believe that the experimental results {zj} should 
be maximally dependent on whether the corresponding 
regions are bound or not, {xi}. To characterize the full 
statistical dependency without making any assumption 
about the probability distribution P(zi\xi), we compute 
the mutual information I{xi;zi). Remember that this 
independence of any error model was one of our ini- 
tial motivations. From a table like the one in Fig. [27) 
I{xi; Zi) can be directly computed, by simply accumulat- 
ing the joint probability distribution P{xi, Zi) across all 



Information I{xi;zi) can be computed for any choice 
of the energy matrix e. We should now search the space 
of all energy matrices (3 x L parameters) for that matrix 
e* that will maximize this information. While nontrivial, 
this search can be implemented using Metropolis Monte 
Carlo (MMC) methods. This search does not only yield 
the best matrix e*, but an ensemble - a solution set - 
of matrices, all of which explain the data almost equally 
well and yield the same value for information /. We will 
not go into the details here beyond stressing that since 
we end up with a set of good solutions, we can compute 
how well constrained the energy matrix elements eab are 
and put rigorous error bars onto them. 

How do the results look like for the yeast Abflp ex- 
ample inferred from PBM data? Figure [28] is reproduced 
from Ref. (Kinney et al 2006), and it shows the inferred 



energy matrix and the error bars for each of the ma- 
trix elements. We see that most of the energy matrix 
elements are constrained very well by the data; a pat- 
tern emerges where Abflp makes contact to the DNA in 
two regions, with small (but still significant) energy con- 
tributions from the basepairs between the two regions. 
These results are broadly consistent with, but more pre- 
cise than, previous energy models for Abflp. 

Figure [29] shows that the resulting energy matrices un- 
ambiguously split all intergenic regions into those that 
are bound and those that are not. More surprisingly, as 
part of our results (not as an assumption!), we also learn 
P{xi\zi), the probability that the site is bound given the 
experimental data bin, zf, this quantity is related to the 
error model P(zi\xi), which we did not want to assume 
a priori. This curve has a sigmoidal shape, showing that 
no single hard threshold (as has traditionally been done) 
will perfectly separate regions that are bound from those 
that are not. This finding also has biological implica- 
tion - it is saying that there are both strong and weak 
binding sites, and some of the weakly bound sites are 
mixed into experimentally determined bins with regions 
that truly don't contain binding sites. Lastly, Fig. 29; 
shows that our inferred binding energies are conserved 
across two yeast species despite significant difference in 
aligned sequences due to evolutionary distance. 

We find many more binding sites than the number in- 
ferred by Ref. (Mukherjee et al 2004), where they set a 



very strict experimental threshold to avoid false positives. 
Interestingly, we can run exactly the same analysis on the 
Chip data by Lee et al (Lee et al [2002 ); we find that the 
inferred binding sites from two different experimental as- 
says performed by two separate experimental groups can 
be made consistent, unlike the discrepancy observed in 
direct region comparison, see Fig. 30 We think that 



This can be corrected for sampling errors, as discussed in Sec- 
tion HVl 
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FIG. 28 Results for Abfl p energy matrix in yea st inferred 
from PBM data fro m Ref. (|Mukherjee et al| [20041 ); figure re- 
produced from Ref. (Kinney et al 20061. Panel a) shows the 



energy matrix eat, with hotter colors indicating larger energy 
matrix contributions, and darker colors smaller contributions. 
Panel b) shows the error bars on each of the energy matrix 
elements, ctj^^. Panel c) plots the matrix elements vs their 
errorbars (y axis), and also shows in the two example insets, 
the distribution of values for two particular energy matrix 
elements in the solution ensemble found by MMC optimiza- 
tion. Here we looked for an energy matrix of length L — 20 
basepairs. The search can be repeated for L = 14, 16, 18 to 
show that the core elements remain unchanged, and the same 
energy matrix is found each time. The inference can also be 
performed by splitting the total dataset (5812 sites) into ran- 
dom halves, and showing that on each half consistent energy 
matrices are found, so that there is no overfitting [not shown, 
c.f. Ref. ([Kinney et all|2006|)]. 



this finding illustrates that proper inference can lead to 
more complete and consistent identification of binding 
sites, including the weak ones. In conclusion, let's com- 
ment on why mutual-information inference was able to 
provide us with good energy matrix models despite our 
inability to write down the likelihood (or error model) 
P(data|model). The answer lies in the observation that 
with enough data, one can simultaneously infer this error 
model along with the energy matrix. There is a formal 
way to show the connection between mutual information 
inference of Eq. (61 1 and the Bayes maximum likelihood 
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FIG. 29 Further results for Abflp energy matrix in yeast 
inferred from PBM data fro m Ref. (|Mukherjee et al| |2004[ |; 



figure reproduced from Ref. (Kinney et al 20061. Panel a) 
shows how K = 5812 intergenic regions of yeast are catego- 
rized as bound 2:^ = 1 or unbound a;; = by energy ma- 
trices in the solution set. Plotted is the histogram of the 
"hit fraction", (xi)^, i.e. the average value of Xi for each re- 
gion across all solution energy matrices. The regions clearly 
separate into two classes: most of the regions are declared 
unbound (HF~ 0) by all energy matrices (the histogram bar 
corresponding to HF= would extend beyond 1500 counts 
in the plot, but is cut for clarity), and a smaller set of re- 
gions that is declared bound (HF~ 1), by most of the energy 
matrices. Panel b) shows the histogram of raw experimental 
data in blue, the experimentalists' threshold (green line), and 
the inferred P{xi\zi) (sigmoidally shaped scatterplot with er- 
rorbars). The sigmoidal shape has not been assumed, but is 
the result of our inference. Panel c) shows the energies as- 
signed to sites of length L in every intergenic region in S. 
cerevisiae (the yeast species on which the analysis was run) 
vs the energies assigned to orthologous sites in a related yeast 
species, S bayanus. Those sites that are declared bound (be- 
low threshold O = 1) in 5 cerevisiae also have correspondingly 
low energies in the other yeast species, forming an island of 
points in the lower left corner of the plot. For sites above the 
threshold which are non-functional, the correlation between 
the energies in the two species is much weaker. 



inference, and we point the reader to details of Ref. (Kin- 
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FIG. 30 Lower row: raw data from PBM and ChIP exper- 
iments on Abflp, along with experimental thresholds. An 
intersection of regions that pass the threshold in both ex- 
periments shows large inconsistencies, as shown in Panel b). 
Panel a) shows the intersection of binding sites inferred using 
the mutual-information method presented here. More sites 
are found, and the sites found in the ChIP experiment are 
almost a perfect subset of the PBM-identified sites. Figure 
reproduced from Ref. (Kinney et al 20061. 



ney et al 2006 1 . The key point is that one can formulate 



the problem as maximum likelihood inference using an 
unknown error model P{zi\xi) and then average over all 
such error models with some prior; in that case one ob- 
tains an "error model averaged" log likelihood log Pema 
for the data, and this turns to be directly related to the 
mutual information of Eq (61): log Pema = KI{zi]Xi)+ 
(irrelevant terms). Thus a mathematical connection can 
be established between mutual information and Baycsian 
inference. 

To summarize, a new method for inferring TF energy 
models from a wide variety of experimental data has 
been proposed and shown to bring various existing ex- 
periments into concordance. High-throughput datasets 
provide enough data - if all data is indeed used for infer- 
rence and not ignored by arbitrary thresholds - to swamp 
the uncertainty introduced because of our ignorance of 
the real error model. It can be shown that this inference 
also works well when genomic background is complicated, 
as in Ref. (Elemento et al 2007); that reference also pro- 



vides an online tool which uses the same framework to 
learn many TF energy models simultaneously across the 
genome. Mutual information inference procedure does 
not rely on the relation between the position weight ma- 
trix (PWM) and the energy matrix, and thus does not 
require the validity of assumptions underlying the Berg 
& von Hippel argument (Berg et al 1987). 



D. Probing combinatorial regulation 

We conclude thi s lecture by br i efly re viewing the work 
of Kinney et al (Kinney et al 2010), that has com- 



throughput deep-sequencing based approach, in order to 
(i) precisely quantify the contribution of each basepair 
in the regulatory sequence to the function; and (ii) build 
detailed biophysical models of combinatorial regulation. 



In Ref. (Kinney et al 2010), the authors decided to 



reexamine the regulatory region of Lac operon, where 
both the CRP transcriptional activator, and the RNAP 
polymerase bind to regulate the expression of lac genes; 
see Fig. |9] 




mutated region 



FIG. 31 The regulatory sequence of the lac operon, with bind- 
ing sites for the CRP transcription factor and the RNA poly- 
merase, reproduced from Ref. (Kinney et al 20101. To probe 



bined the mutual-information inference with a new high- 



precisely how the sequence impacts function, a plasmid library 
was engineered in which mutagenized versions of the regula- 
tory sequence (bright green) drove the expression of GFP. E 
coli were transformed with the plasmids, the expression of the 
GFP was induced, and the fluorescence could be measured at 
the single cell level by FACS. 



The key to the experiment was to create a large library 
of plasmids that differ only in that the regulatory se- 
quence of interest has been mutated, see Fig. [31] this reg- 
ulatory sequence controlled the expression level of GFP, 
which could easily be recorded in the experiment. The 
cells with different regulatory sequences Si drove various 
levels of fluorescence, that the experimenters sorted into 
9 bins, Zi. Despite being seemingly very different from 
the PBM / Chip setups described in the previous section, 
here too one is working with a large number {K ~ 5 • 10^ 
per experiment) of pairs of sequences and the experimen- 
tal readouts, and an almost identical inference technique 
can be applied. 

First, one may ask directly about the mutual infor- 
mation I{bi] z), about the identity of the base pair ha = 
{A, C, T, G} at position a = 1, . . . , 70 in the mutated reg- 
ulatory sequence, about the expression level z of GFP, as 
measured by fluorescence activated cell sorting (FACS). 
This is a direct measure, without making any modeling 
assumptions, about how each base pair on its own affects 
expression; see Fig. [32j 

As Fig. [32] shows, the method yields extremely precise 
information footprints that characterize the functional 
impact of every nucleotide on the promoter activity, us- 
ing the mutual information measure. This method is 
clearly applicable to unknown regulatory regions if one 
wishes to quantitatively determine which nucleotides are 
functionally important. 

Finally, it is possible to use the mutual-information 
inference to learn the energy matrices of the CRP and 
RNAP, ecRp and e/jjv^p, and the possible energetic in- 
teraction between the two proteins. Here, the interac- 
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FIG. 32 Information footprints of CRP and RNAP binding to 



DNA in the lac promoter, reproduced from Ref. ( Kinney et al 
2010 1. The blue and violet regions on the sequence (green) de- 



note where RNAP and CRP, respectively, are thought to con- 
tact the DNA. Panel a) shows the case where CRP is bound. 
Below the sequence, the information /(feo,; z) is shown, in bits, 
about the identity of a single nucleotide and the final GRP 
fluorescence, together with the error bars. In the third row we 
see the zoom-in of the small information values. The method 
yields precise results with very small error bars; the results 
are broadly consistent with what is known, but also show 
that some positions carry small, but significant information 
about promoter activity. Panel b) shows the same analysis 
performed when CRP is not bound. Within error bars, most 
of the positions that are informative about expression with 
CRP bound are now 0, as expected. 



tion even helps in constraining the absolute energy scale 
of the energy models^^, so the binding energies can be 
expressed in physical units, as shown in Fig. |33[ Using 
a convincing series of information-theoretic arguments, 
the authors can compare how much information the se- 
quence gives about the promoter activity /(s; z) (this is 
computable directly from the data without any model). 



It can be shown mathematically that the interaction energy term 
breaks the arbitrariness about the scale of the energy matrices 
present when single TF binding is analyzed. 



with the fraction of that information that is captured 
by any particular model ~ this can inform us how good 
any model is with respect to reality. They show that 
the thermodynamic model of combinatorial regulation, 
developed in line with the reasoning of Section [nj does 
an excellent job of accounting for the data, and that the 
models inferred using mutual-information inference out- 
perform the models constructed using the Berg & von 
Hippel relation in Eq. (59). For many other interesting 



details an d controls we refer the reader to the original 
reference (Kinney et al, 2010). 
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FIG. 33 Energy matrix for CRP binding in panel a) and 
for RNAP in b), when probed separately using mutual- 
information inference. Panel c) shows the joint inference of 
both energy matrices together with the energy of interaction, 
Ei. The matrices in c) are consistent with the separately 
learned matrices in a,b). Due to the presence of coopera- 
tive interaction e^, it is possible to put absolute units on all 
energies. Figure reproduced from Ref. (Kinney et al 20101. 



Hopefully these examples present a strong case for the 
usefulness of information-theoretic measures and meth- 
ods, and demonstrate that inference should develop in 
step with advances in experimental techniques. In partic- 
ular, the examples highlight the difference between tra- 
ditional physics experiments in which the instruments 
can be calibrated and understood well enough for us to 
obtain a handle on P(data| model), and quantitative bi- 
ology, when such understanding is often lacking. If the 
latter case and when using biased or ad hoc inference 
methods, it is not clear that a larger dataset would actu- 
ally lead to better models. On the other hand, principled 
methods can give a very detailed, quantitative and physi- 
cal account of what is happening in the regulatory regions 
of the genome. The price for this performance is large 
required amount of data and computational time, but as 
the field progresses, those factors are becoming less and 
less important as practical constraints. 
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E. Relation to neuroscience 

A very similar method has been devised to probe the 
behavior of sensory neurons. As we mentioned in Sec- 
tion [nj sensory neurons are often described in terms of 
a hnear-nonlinear (LN) model: a stimulus, e.g. a movie 
displayed to a retinal ganglion cell, is convolved first with 
a linear kernel £ that parametrizes the preference of the 
neuron for some linear feature (dimension) in a normally 
highly-dimensional stimulus. The result of this convo- 
lution is then passed through a nonlinear function that 
yields the probability rate for generating a spike. As 
discussed, various methods have been developed to infer 
the linear filter £ from traditional experimental setups, 
in which a large number of stimulus snippets Si are pre- 
sented, and the spike/no-spike Zi is recorded. Note the 
analogy with gene regulation; we have stimulus of high 
dimensionality (sequence or a movie), experimental out- 
put (GFP level or spikes), we are looking for a linear 
function on the stimulus space (energy matrix e or linear 
kernel £), and we don't know the probabilistic model of 
how the output is generated given the product of stimulus 
with the kernel (or sequence with the energy matrix). 

If we knew the nonlinearity that the neuron imple- 
ments and its likelihood function for determining when 
it spikes, we could infer the linear kernel C by writing 
down the probability of data given the model, and do 
Bayesian inference of the model. But just as in the case 
of transcriptional regulation, these quantities too are of- 
ten unknown. A method called maximally informative 
dimensions that finds the linear kernel C by maximizing 
the mutual information between the stimulus projection 
and the spike trains has been devel oped and successfull y 
applied in several sensory systems (Sharpee et al, 2004). 



VI. RECONSTRUCTING BIOLOGICAL NETWORKS 
USING THE MAXIMUM ENTROPY PRINCIPLE 

One of the most pressing questions in systems biol- 
ogy today deals with deciphering the structure of regu- 
latory networks from data. The traditional way in which 
such networks were dissected was through genetic exper- 
iments, where painstaking experimentation and muta- 
genic studies helped deconstruct the networks one link 
at a time. With the advent of high-throughput exper- 
iments, such as microarrays, ChIP and protein binding 
arrays, as well as simultaneous stains and fluorescent pro- 
tein fusions etc, the need arose for computational tools 
that would be able to infer networks from such datasets 
directly. 

Apart from being high-throughput, some of these tech- 
niques have opened up another window: instead of look- 
ing at bunches of cells mixed together (like in microar- 
rays), they are enabling experimenters to record simul- 
taneous expression or activation levels of the nodes of 
a single biological network, without pooling over many 
"copies" of such networks (e.g. extracted from differ- 



ent cells). In physics terms, this means that not only 
are the mean activation or expression levels indicative 
of the network activity, but so are the correlated fluctu- 
ations among the nodes. To make use of this fact, we 
are looking for physical models that include modeling of 
the (correlated) noise. The structure of fluctuations can 
potentially tell us a lot about the wiring diagram of the 
system. 

The simplest - and still most widely used - approaches 
for detecting network structure rely on studying correla- 
tions directly. In a typical microarray experiment, the 
cell cultures are exposed to various external perturba- 
tions and the mRNA levels for genes across perturbations 
are recorded. Then a pairwise correlation matrix is com- 
puted across the perturbations, yielding a, N x N pair- 
wise similarity matrix between N genes (generalizations 
are possible by measuring mutual information instead of 
correlation coefficients). Such a similarity matrix can 
then be used as an input to, for example, clustering. 

Clustering is one of the simplest and most scalable 
methods of understanding the collective behavior of a 
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network. Consider the information matrix of Fig, 
as a matrix of weights between the nodes of a graph: 
the graph has strongly connected components that cor- 
respond to clusters (blocks on the diagonal of the infor- 
mation matrix) and these blocks are weakly coupled to 
other blocks. One might even threshold the information 
matrix and draw binary links in the graph whenever the 
similarity measure exceeds the threshold value, and some 
researchers have indeed taken this approach. 

Clustering turns out to be an extremely powerful ap- 
proach for several reasons. Firstly, in gene regulation 
we know that out of the whole set of genes, the total 
number of genes that regulate other genes - so called 
transcription factors ~ is on the order of a few percent. 
Although this, much smaller, group of genes with regula- 
tory power could conspire combinatorially and still regu- 
late every other gene in a complicated individual fashion, 
many genes need to be up- or down-regulated together, 
because they act as enzymes in connected reaction path- 
ways or they need to be active in a specific tissue. This 
coregulation is the basis for the success of the clustering 
approach: coregulated genes cluster and cluster mem- 
bers are assumed to be regulated in identical ways by 
their (one or few) transcription factor(s). 

Although clustering is clearly productive as a first step 
in understanding genetic regulatory networks, it is not a 
generative model of the network. It reorders the nodes 
so that the structure (hopefully) becomes apparent, but 
does not give any prescription about how the activity of 
one gene influences the activity of the others - the only 
input to the clustering procedure is the mutual informa- 
tion, and we explicitly stated that information measures 
dependency without revealing anything about underlying 
functional relationships. Moreover, as we will soon see, 
understanding that network elements Ui and aj are cor- 
related, which is the basis of clustering, tells us nothing 
about whether Ui is really directly influencing Uj ; in par- 
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ticular, in gene regulation, the genes are coregulated and 
are therefore coexpressed, and correlation does not imply 
causation or direct interaction. Despite being very prac- 
tical, clustering leaves too many questions unanswered if 
we want to understand network behavior. 



A. Correlations vs interactions 

Can we disentangle the mesh of correlations and sep- 
arate the correlations caused by real underlying interac- 
tions from the correlations induced indirectly by other 
interactions, as is illustrated in Fig. [34| ? 

To start, we recall a classic problem in statistical 
physics: we are given a lattice of Ising spins (binary vari- 
ables), and some specification of exchange couplings (in- 
teractions) - perhaps between nearest neighbor only - 
and the exercise requires us to find the equilibrium cor- 
relation function between the spins, i.e. (aiai+A)- In our 
case however, we will be dealing with network "reverse 
engineering." The exchange interactions themselves will 
be unknown, yet we will observe a mesh of correlations. 
The problem will then be to compute the exchange in- 
teractions from the measured correlations, with the hope 
of finding a network defined by the interactions to be 
simpler (for instance sparser) than the network of corre- 
lations. 

Let us formulate the problem more precisely. The 
network consists of N nodes with activities ai, i = 
l,...,iV, which, we will for now assume, can take on 
only two values, tXi G { — 1,1}. An experimental snap- 
shot of the network activity is then described by a vec- 
tor, d = {ai,a2, ■ ■ ■ jCTn}. Our data consist of patterns 
I? = {(7^, (T^, . . . , (T"^}, i.e. there are a total of T simulta- 
neous measurements of the activities at all nodes, while 
the network is in some stationary state. These samples 
can be thought of as "instantaneous" snapshots of the 
system or, in simulation, draws made during a Monte 
Carlo sampling run. From the samples we can estimate 
the moments of {(7^} at successively increasing orders: 
first order moments are N mean activity values, (iTi); 
second moments are N{N — l)/2 correlations, ((TitTj); 
and so on. Because the system is noisy, there will be 
fluctuations around the stationary state and not all T 
patterns are going to be equal. We expect some patterns 
to be more likely than the others, and the full descrip- 
tion of the system in equilibrium must be contained in 
a joint probability distribution, p((Ti, CT2, . . . , <jn)- Get- 
ting a handle on this distribution is therefore our final 
goal, and as we will soon discover, computing successive 
approximations to it will give us the desired interactions 
that underlie the observed correlations. 

Except for a very small number of network nodes there 
is no hope of directly sampling the distribution from the 
data. Its size grows exponentially in N and for a modest 
network of 10 binary nodes we would generally need to 
estimate 2^" ~ 1000 parameters. To proceed, we clearly 
need a simplifying principle. 




FIG. 34 A small three-element section from a network of in- 
teracting nodes. Suppose that ai modulates the activity of 
CTj and (jfe through some microscopic mechanism (denoted by 
thick lines) . We can expect to observe strong correlations be- 
tween ai and aj, and between ai and ak due to this direct 
influence. On the contrary, aj and at are not directly cou- 
pled, but can still show signiflcant correlation (dashed line) 
because of common control by ai . 



A commonly used procedure is called Bayesian net- 

20041), and it is a 



work reconstruction (Friedman et al 



method from the more general class of graphical mod- 
els. One starts by assuming a specific (initial) factoriza- 
tion of the joint probability distribution over all nodes 
and represents it as a graph Q'^, as in Fig 35 Remem- 



bering that the activities are discrete variables, all con- 
ditional distributions in the factorization can be repre- 
sented as probability tables with unknown entries that 
need to be fit from the data. Such fitting procedure 
can be performed in many ways, and one can evaluate 
the likelihood of the fit £(tj")^^. Of course, we have no 
prior knowledge of what the correct graph factorization 
of the initial distribution is, therefore a procedure is de- 
vised that wanders in the space of possible graph topolo- 
gies and tries a likelihood on each, producing a sequence 
{£(^^0), . . . , £(g"), £(^^"+1), . . .}23. The complexity of 
each graph, e.g. the number of links, is penalized and 
combined together with the fit likelihood into a scoring 
function. The goal is to find the factorization of the prob- 
ability distribution with the best score. Presumably, we 
will then have discovered a simple graph that fits the 
data well. 

There have been successful network reconstructions us- 



ing this approach ( Sachs et al 2005 ) . The key simplifying 



assumption that makes this approach feasible is that the 
graph of interactions is sparse, i.e. that there are many 
fewer real than potential interactions. Given such spar- 
sity, the factorized probability distribution will have a 
far smaller number of unknown parameters than the full 
joint distribution, and there will be reasonable hope of fit- 



Bayesian network reconstruction is an iterative procedure, and 
at n-th step, we are considering graph Q", hence the index. 
This will usually be some sort of gradient descent or simulated 
annealing procedure. 
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ting them from the data. The method allows interactions 
of arbitrary complexity (as many arrows converging on a 
single node as possible), but has some drawbacks. Firstly, 
there is an exploding number of graph topologies over N 
elements, and no hope in exhaustively trying all of them; 
whatever algorithm one devises to explore the space of 
topologies, it can get stuck in local extrema of the scor- 
ing function. Secondly, due to computational constraints 
not all kinds of graphs can be explored - usually one has 
to exclude loops and this is a big handicap for biologi- 
cal systems where feedback plays a very important role. 
Finally, because we are looking for a tradeoff between 
the best likelihood fit and the simplicity of the model, 
we have to (arbitrarily) decide how to penalize complex 
topologies. It is not a priori clear that one should simply 
minimize the number of links and disregard other features 
of the graph. In particular, we expect that for systems, 
in which collective effects are driven by the presence of 
weak interactions between lots of pairs, Bayesian method 
will perform poorly. 
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FIG. 35 Bayesian factoring of the probability distribution 
over five nodes. The edges imply conditional dependence, and 
if two nodes are not joined by an edge, they are assumed to be 
conditionally independent. This example graph Q implies that 
the joint probability distribution can be written as follows: 
p(cri, . . . , (Ts) =p(cri)p(a4)p((75)p((T3|cri)p(CT2|o-i, 0-4,0-5). This 
form has 1-1-1-1-1-1-2-1-8 = 13 free parameters in case of 
binary variables (remember that conditional probability dis- 
tributions are normalized) , and is much simpler than the com- 
pletely arbitrary p(o-i, . . . , 0-5), that for 5 binary nodes would 
have 2^ — 1 free parameters. 



B. Maximum-entropy models: general introduction 

Here we will try to take a radically different route to 
the solution, motivated by inverse problems in statistical 
mechanics. This methodology has been applied success- 
fully to a diverse set of biological interacting networks, 
such as neurons, human immune system, and signaling 



networks ( Schneidman et al 


2006 


Tkacik, Schneidman et 


|al';2009"'Mora et al 2010P; the ful 


1 analysis of the dataset 



We start with the realization that with a limited num- 
ber of samples, T, we can successfully estimate several 
lowest-order moments of the sought-for joint distribution 
p((Ti, . . . , ctat) that generated the data, for example, the 
means (tTi) and covariances Cij = (aiaj) — {ai){uj), or, 
in general, a set of mean values of some of the statistics 
of the unknown distribution, also called its "operators," 
(0^((t)) (which can be arbitrary functions of {ci}). For 
any reasonable choice of the operators there is an infi- 
nite number of joint distributions over A'' elements with 
the same mean operator values. Nevertheless, there is 
only one distribution that also has maximum entropy, 
i.e. there is one distribution that is as random as possi- 
ble but still satisfies those statistics that have been mea- 
sured in an experiment. This is the distribution that we 
would like to find, and the maximum entropy principle 
embodies the idea that any structure (or constraint) in 
the distribution has to be induced by the measurement 
(and not by explicit or hidden assumptions on our part). 
In other words, we will approximate the true distribu- 
tion p{ai, . . . ,<Jn) that we cannot measure in full (but 
can estimate some of its statistics Op), with the maxi- 
mum entropy distribution that is as random as possible 
but matches the true distribution in the values of all of 
the measured statistics. 

Formally, we are looking for the extremum of the fol- 
lowing functional: 



dap{a) log2p(CT) 



(62) 



^9f^J dap{a)dp{a) - A J dap{a) 



The first term is the entropy of the distribution, and there 
are /i constraints enforced by their Lagrange multipliers 



(0,,(a))p(5) = (o^; 



cxpt D 1 



(63) 



such that the average values of the operators over the 
sought-after distribution p{(j) are equal to the averages 
over data patterns, T) = {ct^, • . . ■ The Lagrange mul- 
tiplier A enforces the normalization of the distribution. 
It is easy to take the variation in Eq ([62]) and write the 
explicit form for the maximum entropy solution: 



p{a) 



exp 



(64) 



presented here can be found in Ref (Tkacik 2007). 



We call Eq ( 64 ) the maximum entropy distribution with 
constraints {O^). 

Operators that constrain the distribution can be arbi- 
trary, but we can gain further insight by restricting our- 
selves to the moments of increasing orders (the activity 
variables are still binary for simplicity). If one chooses 
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= cr^, then the mean values, ((Ti), are constrained, 
and the maximum entropy distribution is the factor dis- 
tribution: 



Z 



= 2cosh(.g^) 



(65) 
(66) 



This factor distribution is easy to compute, but it does 
not include any interactions - each element behaves in an 
independent fashion (similar to the mean-field theories in 
physics). We could continue constraining the maximum 
entropy distribution with correlation functions of higher 
and higher orders. If we were to fix both mean values 
and two-point correlations, the resulting distribution, Eq 
(64), would have an Ising form: 



this is the simplest distribution that contains pairwise 
interactions between the network elements. 

Constraining the three-point correlations would induce 
a new term in the exponent of the form J2ijk JijkO'i^j'^k- 
There is clearly a "ladder," where higher and higher or- 
der constraints are imposed on the distribution, and as 
a result, better and better maximum entropy approx- 
imations are constructed. Let us call, then, p'^'^^a) a 
maximum entropy distribution consistent with correla- 
tions of order k and smaller, in line with our notation for 
the factor distribution, p(^)(tT). In an iV-body system, 
the highest order of correlation is N, and p^^-* (ct) must 
therefore be the exact joint distribution - at this order 
our approximation is the exact solution, with entropy 
equal to S[p{a)]. In Ref (Schneidman et al 2003) it has 



been shown that this sequence of ever better maximum 
entropy approximations defines a unique decomposition 
of multi- information of Eq (56): 



N 



I[p{a)] = 



(68) 



k=2 



j{k) ^ ^[p('=-i)(^)] -^[pC^n^)]- (69) 

In words, the connected information of order k, is 
the difference of the entropies of the maximum entropy 
distribution consistent with correlations of order fc — 1 
and one higher order. For example, connected informa- 
tion of the second order is the reduction of the entropy 
due to pairwise interactions; one creates the best factor 
(independent) model for the data and the best pairwise 
(two-body Ising) model for the data, and compares their 
entropies to see how much of the total structure in the 
joint distribution has been explained by purely pairwise 
terms. 



C. Maximum-entropy models: pairwise interactions 

How do we use this framework to model real networks? 
Once we collected the measured correlations, we would 



postulate the maximum entropy model of Eq (64) and 
solve the equations that determine all couplings, Eq (63); 
mathematically, we need to find {/i^, Jij} that solve the 
following equations: 



p(o-i, ■ 


■ • jCTat) 




dZ 




dhi 




dZ 







1 



z 



(70) 
(71) 
(72) 



where the expectation values on the right-hand side are 
measured in the dataset V. 

This procedure yields two important results: (i) since 
we have a generative model of the data, i.e. the proba- 
bility distribution, we can calculate and predict any ex- 
pectation value (especially of those statistics that were 
not used as constraints), and compare it to experiment; 
(ii) we can examine the couplings {Jij}, conjugate to the 
constrained operators, and interpret these as interactions 
that cause and explain the observed correlations. 

As is done in Bayesian network reconstruction, once 
we have computed the couplings, we can draw a graph- 
ical model of the network with a link for each nonzero 
coupling Jij connecting the elements ct^ and ctj^^. These 
weighted links are undirected as there is generally no way 
of determining the "direction" of the interactions from an 
equilibrium model. 

Assumptions underlying maximum entropy recon- 
struction are quite different from its Bayesian relative: 
whereas in the latter case we assume sparse a network 
of (arbitrarily complex) interactions, we assume an ar- 
bitrarily dense network of simple (low order, e.g. pair- 
wise or triplet) interactions in the former case. To ex- 
plain all N{N — l)/2 pairwise correlations one needs the 
full matrix of N{N — l)/2 exchange couplings Jij^^, and 
therefore no discrete topology on the graph is assumed 
a priori. There is hence no problem of searching and 
scoring the space of topologies, no exclusion of graphs 
that include loops, and reduced dependence on the im- 
plementation details of the algorithm. The drawback is 
the ab initio exclusion of complex irreducible interactions 
between many nodes. Clearly, the real question to ask is 
about the approximation regime that is more suitable to 
biological systems, if a general answer exists at all. 

In practice, unfortunately, the maximum entropy net- 
work reconstruction is made difhcult by two problems. 
One is technical - solving coupling Eqs (63) is very hard. 



Therefore, for instance, the graphical decomposition of the prob- 
ability distribution plotted in Fig |35| would correspond to the 
distribution p = exp{—H)/Z, where H = hicri + Jiacicra + 
■^1450"! ''■40"5 in the maximum entropy picture. 
For higher orders, there is similarly no restriction on the structure 
of, for example, three-point interactions, Jijk- 
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In essence, one needs to solve 



D. Reconstructing a biochemical network 



d\ogZ{{g,}) 



/i / cxpt D 1 



(73) 



where Z is the partition function of the maximum en- 
tropy distribution in Eq ( |64[ ). This set of equations is 
both nonlinear in couplings g and requires the evaluation 
of the partition function, Z({gi,}), or effectively a com- 
plete solution of the statistical mechanics problem. The 
other problem concerns the identification of the nodes 
that are observed in the experiment. First, one will usu- 
ally be able to take measurements of only a small subset 
of the nodes comprising the network and we need to be 
concerned about how the hidden nodes influence models 
of visible nodes. Second, even if all nodes were identified, 
there is an issue of "coarse-graining." Is a node with two 
states really an elementary, physical object that only has 
two states (a protein with two phosphorylation states), 
or is it in itself a complex with many states, but for which 
a two-state model might (or not) be a valid approxima- 
tion? We do not have time to systematically address 
these issues in the lecture notes, but do wish to point 
them out. 
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FIG. 36 A diagram of MAP signaling network in human 
CD4 T-cells, reproduced and simplified from Ref ( [Sachs e"t| 
[al 20051. Phosphorylation level of 11 white nodes was ob- 
served; red and green numbers indicate points of intervention 
(i.e. the change of external conditions C in which the network 
operates). These chemical interventions change the state of 
the whole network by locking the activity of the nodes on 
which they act into activated (green) or deactivated (red) 
state. Chemicals 0, 1 and 2 represent naturally occurring 
stimulatory agents; and 1 are present in all C, while 2 is 
present in C = 2. The arrows represent experimentally ver- 
ified chemical interactions; there are a number of known in- 
teractions through intermediaries that are known, but not 
plotted. Gray nodes were not observed in the experiment. 



Here we present a maximum-entropy-based approach 
to biochemical network reconstruction following the steps 
outlined in previously. We tackle these questions on the 
set of 11 interacting proteins and phospholipids (jointly 
referred to as biomolecules here) in a signaling net- 
work of human primary immune system T cells. We 
use data from Ref ( Sachs et al[ 20051, where approxi- 
mately 600 single-cell measurements of the activity level 
of biomolecules have been made for each of the 9 available 
conditions C using flourescent cell cytometry; this dataset 
has already been presented in Section|V] The network has 
been studied in detail and Fig. |36| shows the convention- 
ally accepted interactions, placing the observed proteins 
into their biological context. 

We will assume that, given a set of = 11 net- 
work nodes, their interactions can be well-described as 
occurring only between pairs or perhaps triplets, and 
not as combinatorial interactions involving quadruplets 
or larger groups. We'll assess the validity of this assump- 
tion at the end of the lecture; detailed checks are pre- 



sented in Ref (Tkacik 2007) 



A typical experiment to which maximum-entropy net- 
work reconstruction can be applied will yield a large num- 
ber of simultaneous observations of A^ real- valued activa- 
tion levels for each external stimulus. As a first step 
in the analysis, we discretize the data into two binary 
levels^^. We illustrate the maximum entropy reconstruc- 
tion by focusing on each of the nine experimental con- 
ditions separately, and attempt to address the questions 
presented above. It is possible to formulate maximum en- 
tropy problem such that the network reconstruction takes 
advantage of all experimental conditions simultaneously; 
for details see Ref (iTkacikl |2007| . 



E. Analyzing a single condition 

The data collected for conditions 1 and 2 describe the 
activation levels of 11 biomolecules when the cells are ex- 
posed to their natural stimulatory signals. If we focus on 
each of the two conditions separately, we will be dealing 
with draws from two stationary distributions. We first 
discretize separately the data in each condition, and end 
up with 11 bit binary words that represent fluctuations 
around the steady state in that condition. Because the 
nodes are functionally connected, the fluctuations are not 
independent, and must reflect local couplings between 



Discretization can be regarded as a form of data compression; the 
original continuous data have some correlation structure, and as 
quantization maps the data into the discrete domain, we would 
like the structure to remain preserved. There are a lot of techni- 
cal issues involved in discretization, especially when discretizing 
into more than 2 levels; for clarity we skip these problems here. 
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FIG. 37 Interactions Jij (color map) and biases hi (blue line) 
for all external conditions C = 1 ... 9, proceeding top down, 
left to right, computed (both quantization and maximum en- 
tropy reconstruction) separately for each condition. All inter- 
actions Jij are drawn on the same scale, with red color indi- 
cating positive and green color indicating negative couplings. 
Conditions 1 and 2 represent cells exposed to the naturally 
occurring stimulatory chemical signals; other conditions rep- 
resent environments where "intervention" chemicals - which 
are supposed to lock the activity states of certain nodes to 
either "on" or "off" - have been added to the stimulatory 
chemicals of condition 1 . 



2 exhibit a similar pattern of interactions, with those of 
condition 1 being a subset of condition 2; moreover they 
also agree with the conventional map of interactions in 
Fig. [36j except for the interaction between 10 and 11 
(p38, JNK) in condition 2. A possible explanation for 
this interaction is the cross-talk in the MAPKK pathway 
upstream of p38 and JNK: unobserved biomolecules that 
couple pairs of observed proteins would induce effective 
interactions between them. In general, the interaction 
matrices are sparse, and most of the small coupling con- 
stants can be set to zero with minimal change to the 
distribution (not shown) 



nodes near the given steady state. Can we learn some- 
thing from the correlated fluctuations in the activities? 

Having quantized the data into two levels and calcu- 
lated the correlations and mean values, we write down 
the form of maximum entropy distribution consistent 
with these operators; to be consistent with physics con- 
ventions, let's write di — 2ai — 1, such that tJi — ±1 
{di ~ —1 denotes the "off" state and (Ji = 1 denotes the 
"on" state): 



1 <^n) 



: exp ■ 



(74) 

We proceed to calculate the interaction map Jij and the 
biases hi that explain the measured observables [Eg [73], 
by using a numerical nonlinear equation solver^ 



Figure 37 shows reconstructed interaction maps Jij 
and biases for each condition's data quantized and an- 
alyzed separately^*. Interestingly, both condition 1 and 



There is a unique solution for the {hi, Jij} and the problem is 
convex. 

Our data was intrinsically continuous and was discretized into 2 
levels. The biases hi that we compute simply reflect the overall 
bias of the, e.g., node ai to tend towards -1 vs 1. This is not really 



Note again that we are looking only at fluctuations 
around a naturally stimulated steady state. These fluc- 
tuations are much smaller then those induced by inter- 
vening chemicals, which is presumably why we detect 
only a subset of full interactions. 



a property of the data, but of where we draw the discretization 
thresholds; this is in contrast to the interactions Jij which truly 
reflect the interactions in the data. 

One starts with the smallest couplings and proceeds towards 
bigger ones by setting them to zero and calculating the Jensen- 
Shannon distance between such "pruned" and the original distri- 
butions. For conditions 1 and 2, if all exchange interactions but 
for the "skeleton" around the diagonal are set to 0, the Jensen- 
Shannon distance will be around 0.015, i.e. one would need on 
the order of 70 samples to distinguish the full maximum entropy 
from the pruned distribution. 
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Recovery of information for 9 experiments 




FIG. 38 How much of the total multi-information in the dis- 
tribution of activation levels does the pairwise model cap- 
ture? For each of the 9 conditions on the horizontal axis, the 
data is discretized into 2 levels and a pairwise (or three-body) 
Ising model is constructed. The bar heig ht is the total multi- 
information of the distribution [Eq (\56\ ]. The blue (green) 
segment represents the info rmation of the second (third) or- 
der, /'■^'^' [p(ct)], of Eq (69 1. The error bars are entropy es- 
timation errors from the direct estimation obtained by 100 
repeated reestimations (ISlonim et al 20051. 



How much of the complexity of the true distribution is 
captured by the maximum entropy approximation? To 
answer this question we look at the fraction of the multi- 
information of the real distribution that is captured by 
the pairwise model. As Fig. [38] demonstrates, in case 
of condition 2 it recovers almost all of the 2.8 bits of 
total information; for condition 1, however, the fraction 
is around 70 percent out of the total of 1.5 bits"^". 

A test of the pairwise model involves comparing 
the predictions about connected three-point correlations 
(((Ti — ai)((Tj — dj){(Tk — CTfc)) with values estimated from 
the data, as shown in Fig. |39j Thee-point statistics have 
not been constrained by construction in our model, and 
are therefore a real prediction of the maximum-entropy 
distribution. As expected, the match between predictions 
and measurement is good in condition 2 (not shown), 
while for condition 1 we see a single three-point pre- 
dicted correlation deviating strongly from its measured 



value. The corresponding biomolecules are a2^ cr^ and cr4, 
namely PLC7, PIP2 and PIP3, and they are suspected 
to form a feedback loop [Fig [36] . To ascertain that it 
is not only the observed correlation, but actually a true 
triplet interaction between the molecules that generates 
the discrepancy, we can build a new maximum-entropy 
model consistent with three-point marginals. The corre- 
sponding distribution has the following form: 

H = — ^ hidi — - ^ Jij(Jid-j — - ^ JijkCTiCrjak (75) 

i ij ijk 

When we solve for the unknown {hi, Jij, Jijk} in the 
distribution of Eq ( 75 1 , the largest three-point interac- 
tion term is J345. Moreover, in order to convincingly 
show that it really is J345 that fixes the offending three- 
point correlation (as opposed to all other triplet degrees 
of freedom in Eq ( 75 ) ) , we construct yet another max- 
imum entropy model: a pairwise Ising system that in 
addition to all pairwise correlations constrains exactly 
one three-point marginal, p((T3, (74, as), and has a sin- 
gle three-point coupling, J345. The agreement between 
prediction and observations is then restored up to third- 
order in correlations, at the cost of one additional under- 
lying interaction. Experimentally it is also known that 
PLC7 hydrolyses its substrate PIP2 to produce PIPS; 
furthermore it is suspected that PIP3 can recruit PLC7. 
The example analysis presented in this lecture sweeps 
a lot of checks and details under the rug in order for 
the lecture to remain straightforward; for details see Ref 
( |Tkacik[[2007l ). 

In summary, we believe that the maximum-entropy 
network reconstruction procedure offers a viable alter- 
native to Bayesian network reconstruction. The theoret- 
ical foundation provides a way of decomposing the total 
information of a given distribution into a sum of posi- 
tive terms [Eq (69l], each of which indicates the extent 
to which maximum-entropy models incorporating succes- 
sively higher order marginals recover the total complex- 
ity ( jSchneidman et al 2003). A failure to account for 



There might be larger systematic errorbars on experiments 1, 6 
and 9 because the distribution seems considerably more uniform 
than for other experiments and we are low on samples. 



the total information with a simple model is diagnostic 
of complexity being unaccounted for in the model; to 
pinpoint the problem, one compares the prediction and 
measurement of next order correlations; hopefully, the 
failure is localized and not distributed through the net- 
work. If this is the case and fixing the failure requires the 
introduction of a single new interaction, we might believe 
that we have learned something new about the system. In 
the presented example of the MAP cascade, the pairwise 
model accounted very well for data in condition C = 2, 
and less well in C = 1; but even in the latter case, an 
addition of a single combinatorial three-point interaction 
has lead to a considerably improved model. More im- 
portantly, the analysis approach is not good for a single 
system only, but is a principled framework that can in- 
clude systematically more and more complexity until the 
data can be accounted for. 

In general, principled methods for inferring network 
structure from the data are not yet widely used; part 
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FIG. 39 Comparison between three-point connected correla- 
tions measured from the data (horizontal axis) and predicted 
by the pairwise Ising model (vertical axis). Due to small sam- 
ple effects, there are error bars in the correlation estimates; 
the distribution in red is the distribution of the three-point 
connected correlations in a shuffled dataset, and the spread 
in that distribution arises purely from small-sampling effects. 
We see that three-point correlations are small with the excep- 
tion of several elements, and all but one agree with the model 
prediction. The mismatch is the three-body correlation be- 
tween biomolecules 3, 4 and 5 (see main text). 



of the reason is that the power of these methods ulti- 
mately derives from the quality of the measurements. If 
the experimental noise levels swamp the intrinsic noise of 
the regulatory process, then the benefits of observing the 
correlated fluctuations in the system are lost. In signal- 
ing pathways and genetic regulation the network recon- 
struction procedures deployed to date have mostly been 
technical demonstrations that the approach is possible or 
feasible, and its correctness was judged by comparison to 
a manually curated "gold standard" network, assembled 
from the literature. 



F. Relation to neuroscience 



neurons together (not just pairs), it became increasingly 
clear that the assumption of neural independence leads 
to worse and worse predictions about how groups should 
behave in comparison to the experiments; at groups com- 
prising A'^ = 10 neurons the mismatch was extremely 

large. 

Ref. (|Schneidman et al 2006) proceeded to fit maxi- 
mum entropy models with pairwise interactions to groups 
of up to 15 neurons. The resulting models accounted for 
data very well, leading to a reinterpretation of the neu- 
ral behavior, where a dense network of weak interactions 
induces strong effects on how groups or populations of 
neurons behave. These analyses were later extended to 
groups of 40 neurons, and analogies could be made be- 
tween the behavior of these inferred networks and the 
theoretical models of frustrated collective behaviors (spin 
glasses) in physics ( Tkacik, Schneidman et al 2009 ) . 



VII. TOWARDS A POSSIBLE NETWORK DESIGN 
PRINCIPLE 

In the last lecture, we will seriously consider the idea 
that the function of biological regulatory networks is to 
transmit information. Armed with the mathematical 



concept of information introduced in Section IV we will 
be able to formalize the notion that a transcription fac- 
tor present at concentration c drives a set of downstream 
genes, {gi}, and that the expression levels of these genes 
therefore jointly carry information about c. We will ar- 
gue that there exist certain regulatory wiring diagrams 
for the network c — ^ {gi} along with the associated regu- 
latory parameters (such as interaction strengths), which 
increase, or even maximize, the information transmis- 
sion between the transcription factor and its regulatory 
targets. If we believe that the ability to transmit infor- 
mation is under positive selection, then evolution might 
drive real regulatory networks towards such maxima. We 
end by proposing that such an optimization principle 
could be a good candidate for a "design principle" for 
biological information processing networks. We will use 
measurements from early Drosophila development to il- 
lustrate these ideas. 



In the case of the neural networks in the retina, true 
new insight about how the network works was provided 



by maximum-entropy models ( Schneidman et al 2006 



Tkacik, Schneidman et al 2009). Recordings of neural 



activities there have consistently shown that any pair of 
neurons in close physical proximity in the retina has ac- 
tivities that are weakly, but statistically significantly cor- 
related (i.e. the average correlation coefficients in pairs 
of neural spike trains were ~ 0.05). It was usually as- 
sumed that this meant that correlations are a small effect 
and can safely be neglected, leading to an interpretation 
that separate retinal ganglion cells are independent en- 
coders of information about the light in the visual field. 
However, when looking at larger and larger groups of 



A. Early morphogenesis In Drosophila melanogaster 

After fertilization, interesting processes are taking 
place in the ellipsoidally shaped egg, about half a mil- 
limeter in length, of the fruit fly. A single nucleus un- 
dergoes 14 rounds of nuclear divisions, before large-scale 
spatial rearrangements, easily observed under the micro- 
scope, start happening approximately 3 hours after fertil- 
ization. During these first hours, all nuclei are floating in 
a shared pool of cytoplasm, in a structure that is known 
as a syncytium; only at division cycle 14 do individual 
nuclei cellularize. 

In a groundbreaking series of genetic experiments, re- 
searchers have shown that during these early develop- 
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mental stages when all nuclei look identical and no differ- 
entiated structures are visible, important cell-fate deter- 
mination steps are already taking place. Looking along 
the long axis of the ellipsoidal egg, known as the AP 
(anterior-posterior) axis, one can see about 100 rows of 
nuclei at cell cycle 14. These nuclei express proteins 
(mostly transcription factors) that will confer cell fate: 
nuclei belonging to various spatial domains of the embryo 
express specific combinations of genes that will lead these 
nuclei to become precursors of different tissues. Stainings 
for relevant transcription factors have shown a remark- 
able degree of precision with which the spatial domain 
boundaries are drawn in each single embryo, and a stun- 
ning reproducibility in positioning of these domains be- 
tween embryos. Although probably a slight overgeneral- 
ization, we can say that at the end of cell cycle 14, along 
the AP axis, each row of nuclei reliably and reproducibly 
expresses a gene expression pattern that is characteristic 
of that row only - in other words, the nuclei have unique 
identities encoded by expression levels of developmental 
TPs along the long axis of the embryo. 

Decades of research have focused on the following ques- 
tions about early development, with the hope that what 
is learned in Drosophila will shed light on how develop- 
ment and cell differentiation proceed in general: (i) How 
are the spatial domains established? What are the inputs 
that break the initial symmetry where all nuclei start out 
in the same state with the same genetic material? (ii) 
What is the wiring diagram of a network that takes the 
input information and processes it to the point where the 
nuclei have their identity encoded in the expression pat- 
tern of late developmental genes? (iii) What is respon- 
sible for the precision and reproducibility of cell fate de- 
termination? What are the limits to this precision? Are 
there mechanisms that confer robustness to certain envi- 
ronmental fluctuations, such as natural variation in tem- 
perature or physical size of the egg? (iv) What are the 
molecular mechanisms of regulation and cross-regulation 
implicated in the developmental network? (v) What are 
the mechanisms that allow the signals to propagate spa- 
tially in the developing egg and that coordinate the re- 
sponse of different nuclei, such that e.g. the expression 
domains are not "noisy" or jagged? 

The answers to some of these questions are certainly 
known qualitatively, although we are still trying to make 
the models and measurements fit quantitatively. Briefly, 
the mother breaks the symmetry by depositing sources of 
so-called maternal morphogens, or diffusible transcription 
factors, at key points of the embryo. Por example, the 
source mRNA for the bicoid protein that was discussed 
in Section |III[ is positioned at the anterior pole of the 
embryo. There, mRNA is translated into protein, which 
diffuses along the AP axis and is continuously degraded, 
establishing a steady-state concentration profile along the 
AP axis which is well-fit by an exponential: 



c{x) — CqG 



(76) 



anterior pole and A = 0.25L (and L is the total length of 
the embryo). 

This spatial gradient is a chemical coordinate system: 
it is thought that each nucleus can read off the local con- 
centration of bicoid (and other morphogens), and based 
on these inputs, drive the expression of the second layer 
of developmental genes (called the gap genes, which we 
denote hy gi); these in turn lead to ever more refined spa- 
tial patterns of gene expression that ultimately generate 
the cell fate specification precise to a single-nuclear row. 



B. Posing the question 

Let's attempt to put together all that we have learned 
up to now about gene regulation, noise in gene regula- 
tion and information theory. On one hand we can make 
a simple back-of-the-envelope calculation: If there are 
100 distinguishable states of gene expression along the 
AP axis responsible for 100 distinct rows of nuclei, some 
mechanism must have delivered / « log2(100) ~ 7 bits of 
information to the nuclei. That's the minimum amount 
of information needed to make a decision about the cell 
fate along the AP axis. Similar patterning mechanisms 
also act along the other axes of the embryo, and if each of 
the 6000 nuclei at cell cycle 14 were uniquely determined, 
these systems together would have to deliver about 13 
bits of information. 

We also know something about the information fiow in 
genetic regulatory networks, and we can start by asking 
not how much information the nuclei need, but how much 
can be delivered. We have studied how bicoid regulates 
one of the gap genes, hunchback, in Sections [ll] and |TTT] 
We will first take a look at this single regulatory element 
and ask if, given the measured noise, the element is used 
optimally as part of the regulatory network. Then we 
will generalize by assuming that c — {gi}, i-e. that bi- 
coid input c regulates a set of (possibly interacting) gap 
genes gi. The noise in gene expression was discussed in 
Section |III| for the regulation of hunchback by bicoid; if 
that can be taken as typical for other elements of the net- 
work, we can indeed compute I{c;{gi}) and ask if that 
quantity can approach the / sa 7 bits of information that 
is needed on theoretical grounds. 



C. Information transmission from bicoid to hunchback 

By simultaneously observing the concentrations of bi- 
coid (c) and hunchback (g) across the nuclei of an embryo, 
one can sample the joint distribution P{c,g), see Fig. 40 
Usually it was assumed that hunchback provides a sharp, 
step-like response to its input, bicoid; mathematically, 
this would mean that the bcd/hb input /output relation 
is switch-like, with an "on" and an "off" state, yielding 
information transmission capacities of about 1 bit. How- 
ever, is this really the case? 



where x is the distance on the AP axis measured from the 



Using our estimation methods from Section IV we 
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FIG. 40 Drosophila melanogaster embryo at cell cycle 14 
Nuclei stained in blue (see inset), bicoid stained in green and 
hunchback stained in red, data reproduced from Ref ( Gregor 
et al 2007). At this stage, about 6000 nuclei are present in the 



embryo, of which about a quarter are visible under a single 
microscope view. Each nucleus provides a joint quantitative 
readout proportional to bicoid and hunchback intensities; the 
data is shown in scatter plot below. Usually hunchback was 
understood as having a single precise boundary that sepa- 
rates the domain of high expression ( "on" ) from the domain 
of low expression ("off"). We would like to use information 
theory to make this statement precise and find out if the 
bicoid/hunchback regulatory element really is just a binary 
switch. 



can measure directly how much information bicoid c 
and hunchback g carry about each other. The result 
/oxpt(c;g) = 1.5 ± 0.1 bits, where the error bar is com- 
puted across 9 embryos. This is an experimentally de- 
termined quantity, and the errors (apart from the esti- 
mation bias) are related mostly to our ability to fairly 
sample the distribution P(c, g) across the ensemble of 
nuclei. Our sampling is not complete because a single 
microscope view only records about a quarter of all nu- 
clei, but we believe that that sampling is not very biased. 
Another point to have in mind is that the computation 
of /(c; g) reflects all statistical dependency in the prob- 
abilistic relation c ^ g: both the direct regulation, as 
well as any possible indirect regulation through an un- 
known intermediary x, e.g. c ^ x g. If, however, g 
is regulated also by an input y independent of c, that is 
{c, y} — >■ g, and our experiment does not record y, then 
we might be assigning some variability (or noise) to g, al- 
though that noise really would be a systematic regulatory 



effect caused by y. In this last case, we would measure 
a smaller value of /(c; 17) than would underestimate the 
real precision in the system; the true value would only be 
revealed upon recording the unobserved regulator y and 
computing I{{c,y};g). 

Having these caveats in mind, our first finding is that 
the information transmission of 1.5 bits between bicoid 
and hunchback that we measure from the data is larger 
than 1 bit, which would be needed if bicoid/hunchback 
transformation were a simple binary switch. To our 
knowledge this was one of the first times that a quantita- 
tive measure of "regulatory power" was computed for a 
genetic regulatory element that was measured in a high- 
precision experiment. 

Given the noise in gene expression, can we put an up- 
per bound of how much information could have maxi- 
mally been transferred between bicoid and hunchback? 
To do this, let's start by writing: 



Pic,g)=P{g\c)PTF{c) 



(77) 



As shown in Sections III|IV the term P{g\c) describes the 
input /output properties of the regulatory element. From 
experiment, we can determine the mean response g{c) 
of the regulatory element and the noise in the resp onse 
CT^(c). These quantities have been plotted in Figs. 8 15 



if the noise is Gaussian (and to a good approximation, it 
is), these two measurements determine P{g\c) fully. 

To ask about the maximum achievable information 
transmission given the measured input/output relation 
P{g\c), we write the Lagrangean 



C[PTFic)] = /(c;g) - A / dcPrf (c) 



(78) 



where A is a Lagrange multiplier that will enforce the 
normalization of Ptf{c), while 



I{c-g) = / dcPTF{c) / dgP{g\c)\og 



P{9\^) 
P{9) 



(79) 



is the mutual information, and P{g) = 
J dc PTF{c)P{g\c)- We can now look for the opti- 
mal distribution of inputs, Ptf{c), which must satisfy: 



6£[Ptf{c)] 
SPtf(c) 



- 0. 



(80) 



One way to solve this variational problem is numerically. 
For details see Refs (Tkacik et al 2008|b ); here we only 
report on the results. 

We find that holding P{g\c) fixed as determined from 
the data on bicoid/hunchback relationship, and optimiz- 
ing Ptf{c) numerically, yielded the maximal channel ca- 
pacity of I*{c;g) — 1.7 bits, see Fig. 41 Additionally 



the optimal P^p{c) predicts the optimal distribution of 
hunchback expression levels observed across the ensem- 
ble of nuclei, through P*{g) — J dc P{g\c)P^p{c), and 
the optimally predicted distribution matches the mea- 
sured distribution very well [Fig. B2]. The value found 
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FIG. 41 The real (measured) information transmission and 
the maximal information transmission (channel capacity) in 
the bicoid/hunchback regulatory system. The input/output 
relation P{g\c) = P(Hb|Bcd) is measured and held fixed. To 
estimate the true information transmission of 1.5 bits, the ex- 
perimentally sampled PTF(Bcd) is used to construct the joint 
P{c,g). To find the channel capacity, Ptf (Bed) is varied 
until the information-maximizing choice is found numerically, 
denoted as P*(Bcd); this yields 1.7 bits of capacity. The opti- 
mal choice for the input distribution also predicts the optimal 
distribution of outputs, shown in Fig. |42[ 



for the maximal information transmission (channel ca- 
pacity) shows that the real biological system is operat- 
ing close to what is achievable given the noise, that is 
Ic^pt{c; g) / 1* {c; g) « 90%. The high value is somewhat 
unexpected given that we know that hunchback is regu- 
lated also by other inputs, and that bicoid also regulates 
other targets. Nevertheless this finding is a good mo- 
tivation to consider taking maximization of information 
transmission seriously as a possible design principle. 

Can we comment on the values in the range of / ^ 
1.5 — 1.7 bits? It turns out that the bicoid gradient is 
read out directly by 4 gap genes: hunchback, kruppel, 
giant and knirps. If each would independently be able 
to encode ^1.5 bits, then together this genes could con- 
vey I(c;{gi}) ^6 — 7 bits of information about bicoid 
and would thus achieve the amount needed for AP pat- 
terning. In this case, we would be able to claim con- 
sistency with the back-of-the-envelope calculation that 
requires at least this amount of information for the AP 
specification. Before reaching such a conclusion, how- 
ever, we need to resolve the following issues: (i) The 
readout (gap) genes {gi} are probably not independent, 
but have some redundancy, which will mean that they 
convey less than the sum of their individual information 
values about c; such redundancy, as we find below, can 
be alleviated by proper network wiring; (ii) The next 
layer of developmental cascade after the gap genes is not 
regulated solely through the gap genes, but receives in- 
puts from maternal morphogens directly; therefore, the 
gap genes are not a single bottleneck through which the 
information can flow; (iii) especially at the poles of the 
embryo, gradients other than bicoid provide spatial infor- 
mation about the AP position; (iv) our formulation of 




FIG. 42 The measured (black) and predicted optimal (red) 
distribution P{g) of hunchback expression levels across an en- 
semble of nuclei in the Drosophila embryo. The expression 
level g goes from (no induction, posterior) to 1 (full in- 
duction, anterior). A considerable fraction (~ 30%) of nuclei 
express intermediate levels of hunchback, and the noise in the 
system is low enough that this intermediate expression level 
could constitute a separate signaling level from and 1; this 
would be consistent with the observed information of 1.5 bits 
that intuitively corresponds to 2^'^ ~ 3 distinguishable lev- 
els of gene expression. The inset shows the same plot on the 
logarithmic scale. 



the problem assumes steady state gap gene readout from 
a stable gradient; it is not clear that such steady state 
is really reached in the timeframe necessary for nuclear 
specification. 

Further experiments and theory will be needed to suc- 
cessfully address these, and possibly other, issues. De- 
spite these concerns, we hope that the discussion provides 
motivation for looking at quantities like /(x; c) - the in- 
formation that the morphogen gradient encodes about 
the physical location x; at /(c; {gi}), and at I{x; {gi}) - 
the information that later developmental genes (like gap 
genes) carry about the physical location. Information 
processing inequalities also constrain the relationships 
between these (directly measurable) quantities, provid- 
ing an implicit check of whether we have missed some 
unobserved regulatory pathway. Before proceeding, we 
note that experiments that probe these quantities are 
not easy, because they require us to measure simultane- 
ously the expression levels of a number of genes, nucleus 
by nucleus, in order to estimate both the mean response 

9i{c) 



D. The small-noise approximation 



Analytically, the problem of Eq ( 80 ) is tractable in the 



so-called small-noise limit, which we present here and use 
to explore the optimal architecture of small regulatory 
networks. 

Having seen that in at least one biological system the 
information transmission can come close to the channel 
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capacity (maximum achievable transmission given noise) , 
we would like to elevate this finding to a principle: let us 
find network wiring diagrams and interaction parameters 
that transmit the most information from input TFs to the 
regulated output genes. 

We will consider networks where a single transcrip- 
tion factor at concentration c can regulate a set of K 
target genes {5;}, which may be interacting in a feed- 
forward network. For now, we will not consider feedback 
loops that can cause multistable behavior. It is clear that 
without any constraint, the information transmission can 
trivially be increased by decreasing the noise, and in bio- 
chemical networks noise can be decreased arbitrarily by 
increasing the number of signaling molecules, both on the 
input side (c) and on the output side {{gi}). The crucial 
idea is therefore to optimize information subject to bio- 
physical constraints, i.e. subject to using a fixed number 
of signaling molecules. 

With these assumptions in mind, we sketch the deriva- 
tion of information transmission in the following text; 



for details see Refs (Tkacik et al 2008 



2009b Walczak 



et al 2010). For additional work on information trans 



mission in biochemical networks see Refs ( Ziv et al 2007 



Tostevin et al 2009 Walczak et al 2009 ) 



by 



The dynamics of gene expression for genes {gi} is given 



= fi{c; {gj})- gi + ii, 



(81) 



where r is the protein lifetime, is the Langevin noise 
term (explained in Section III), and /(c, {^j}) e [0,1] 



is the regulatory (input /output) function, describing the 
activation rate of gene gi , given the input c and the ex- 
pression levels of all the other genes. Various regulation 
functions were discussed in Section |llj for combinatorial 
regulation, the most flexible one that we have examined 
was the Monod-Wyman-Changeaux (MWC) regulation 
function: 



^^(c, to}) 
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n)\og{l + g,/K))+L' 



(82) 
(83) 



In this model, every regulatory input to gi contributes 
a term to the "free energy" and each such term is 
parametrized by , the number of binding site for gj in 
the promoter of g^, and related to the energy of bind- 
ing to that binding site; as before, L is the free energy off- 
set between the "on" and "off" states when no transcrip- 
tion factor is bound. If we want to avoid feedback and 
multistability, we can always renumber the genes such 
that each gene gi only depends on the input c and other 
genes gj where j < i. 

The regulation in a network of a single input c and K 
target genes gi is then described by unknown constants 
{X*, Xj, n^-, , if*}. When — )• 0, the regulation of 



gene j by gene i is absent, that is, in the wiring diagram 
the arrow from gj to gi disappears. 

Before proceeding, we need also to compute the noise 
in this regulatory network. The noise in gi is given by two 
contributions: the output noise from generating a finite 
number of proteins of gi, and the input diffusive noise 
because gi is regulated by c and other gj. The noise in 
our setup with K target genes is fully determined by a 
K X K covariance matrix: 



Ctjic) = ((g» - gi{c)){gj - gj{c))), 



(84) 



which can be computed from Eqs ( 83 ) , as shown in 
Refs (Tkacik et al', '2009b| iWalczak et all 12010 ) . 



In addition to computing this matrix, we find that 
there is a single dimensionless parameter C describing 
the dynamic range of the input, c e [0, C], that will con- 
trol the shape of the optimal solutions'^^. C is the maxi- 
mal concentration for the input c, expressed in "natural 
units of concentration," cq = A^maxZ-Dar, i.e. the max- 
imum number of independent molecules of the output 
-^max, divided by the relevant diffusion constant, typical 
size of the binding site a and the integration protein life- 
type T. Large values for C mean that the output noise 
is dominant over the input noise, while a small dynamic 
range and therefore small C means that the input diffu- 
sive noise in c is the dominant noise in the system. 

With the covariance matrix in hand, the distribution 
of outputs given the input c is a multivariate Gaussian: 



p{{9m 



(27r)^/2^4q 



(85) 



In the language introduced in Section |IV[ this is the 
"encoding" distribution. Suppose that we now ask the 
"decoding" question: having seen the values of gap genes 
{gi}, what is the most likely value of c that produced 
them, and what is the variance in c? If the noise is small, 
P{c\ {gj }) will also be Gaussian, which can be found from 
Eq (85) and the Bayes' theorem: 



P(c|to}) « e 



1 (<=-c*({sj}))^ 
'2 <^^({S,}) 



(86) 



where c*{{gj}) is the most likely value for c that gives 
rise to the observed {gj}, and 



1 



E 



dgi^-idgj 
dc '3 do 



(87) 



CTc is the effective noise level in the input that accounts 
for all the noise in the system^^. 



This is true if all genes {gj} have the same parameters (such 
as diffusion constant and degradation times), an approximation 
that we make. 

In small noise approximation one can reassign the noise from 
the input to the output and vice ve rsa through the mean in- 
put/output relation, as shown in Fig. |13| 
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Information I{c]{gj}) is 

/(c;{5,}) = 5[FTF(c)]-(5[P(c|{g,})])p,,(,), 



where the distribution of inputs, Ptf{c) is unknown. We 
want to find the maximal information transmission given 
the known noise, therefore, we look for the maximum of 



I with respect to Ptf{c), just as we did in Eq (80), while 
insisting that Ptf{c) be normalized. We find that 



1 1 
Z o-c(c) ' 



(89) 



that is, the system should optimally use those input levels 
c more frequently that have proportionately smaller ef- 
fective noise. Using this optimal choice, the information, 
in bits, will be: 



Hc-ddj}) = log2 



(90) 



This is as far as we can puch analytically; I{c;{gj}) 
still depends on the parameters {L% Kpnj,nl, K^} that 
determine the wiring diagram of the network and the 
strengths of the regulatory arrows. The last remaining 
task is, therefore, to numerically optimize Eq (90 1 with 



respect to these parameters, and examine the structure 
of optimal solutions. 



E. Optimal network architectures 

We can finally ask what are the optimal input/output 
curves for K genes {gi}, regulated by the single input c, if 
we do or do not allow for mutual interactions between the 
outputs. These results are a function of C, the dynamic 
range of the input. 

Figure |43] shows the example solutions for K ~ 5 non- 
interacting genes as a function of C. We see that there 
are two regimes: at low C, the optimal solutions for all 
5 genes have exactly the same parameters, and therefore 
their input/output curves overlap perfectly. Why is this 
behavior optimal, if at first glance all the genes appear 
completely redundant? At low C, the input noise is dom- 
inant, and the best strategy is to have all K = 5 genes 
read out the input c and lower the input noise by aver- 
aging: using K readouts should lower the effective noise 
by a factor of 

At high C another strategy, called the tiling solution, 
becomes optimal: here, each gene gi changes its expres- 
sion considerably over some limited range of inputs, and 
various genes gi encode various non- overlapping input 
ranges; in other words, each gi "reports" on its own range 
of inputs, while the other gj have either not switched on 
yet, or are already saturated. We can explore the transi- 
tion from redundant to tiling solutions in detail, and we 
can carefully study the scaling of information capacity 
/(c; {gi}) with the number of genes K in each solution 
dTkacik et all|2009b|). 
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FIG. 43 The optimal input/output relations for K = b genes, 
{gi{c) , . . . , g^{c)} (shown in various colors), regulated inde- 
pendently by a common input, c. The first 5 panels show 
optimal solutions depending on the dynamic range of the in- 
put, C, that is, when c G [0, C]. As C is increased, the totally 
redundant solution, where gi(c) = ... = g^ic), slowly be- 
comes non-redundant and transitions into the tiling solution 
at high C, where each gi independently covers a subrange of 
concentrations for the input c. The last panel shows the op- 
timal values for the dissociation constants, Ki, of all 5 genes, 
as a function of C = Cmax/co- 



Although interesting from a theoretical perspective, 
the redundant and tiling solutions are not what is 
actually observed in the real regulatory networks in 
Drosophila. In particular, when {gi} are independent, 
the only possible input/output relations are sigmoid; 
there are no stripe forming solutions, where gi would turn 
on at some concentration c and turn off at some higher 
concentration. Can such solutions emerge if the activat- 
ing and repressing interactions between the output genes 
are allowed? 

Indeed we find that this is the shown in 

Fig. |44] If the interactions between two output genes 
{51(c), 52(c)} are allowed, the information maximizing 
wiring diagram includes "lateral repression" between the 
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two genes that are jointly activated by a common input. 
This also generates apparent input/output curves that 
are non-monotonic: 52 as a function of c is seen to exhibit 
a stripe of activation. Further work has confirmed that 
such stripe-like patterns optimize information transmis- 
sion. Interestingly, a similar pattern of interconnections 
( "lateral inhibition" ) is known to occur in neural net- 
works involved in early sensory processing. The function 
of such connections is to decrease the redundancy in the 
outputs - with no interconnections in the tiling solution, 
when the gene with the highest K4 is saturated and fully 
active, we know that all the other genes are also fully on 
and saturated: they are therefore providing redundant 
information. In other words, when there is no interac- 
tions, the only patterns of activation (in a simplified pic- 
ture when the genes are binary) are 000, 001, Oil, 111 for 
a case of 3 genes. Patterns such as 010 or 110 cannot be 
accessed if there is no lateral interactions. If they exist, 
however, these patterns can be generated and they can 
encode additional useful information about their input c, 
increasing information transmission. 

Our understanding of information transmission in 
transcriptional networks is far from complete. Never- 
theless, the richness of solutions and network topologies 
that emerges from a single optimization principle in a 
one-parameter (C) problem is very encouraging, as is the 
qualitative matching to the stripe-like solutions in early 
Drosophila development. Further efforts need to be in- 
vested into understanding multi-stability, feedback loops 
and autoregulation, and in the incorporation of other bi- 
ologically realistic detail. Hopefully, this (or some other) 
design principle will in the future enable us to under- 
stand the wiring of biological networks and derive it from 
a mathematical measure of their function, rather than 
reconstructing it back from painstaking molecular disas- 
sembly of the network into its component parts. 



A1A2 - A 




FIG. 44 The optimal input/output relations for two genes 
gi , <?2 regulated by a common input c, with cross-regulatory 
feed-forward interactions and Hill model of regulatory func- 
tions. In case the activating arrow is allowed between gi and 
g2, the optimal solution (gray lines) is not different from a 
non-interacting system, where c independently regulates gi 
and (72: both the input/output curves as well as the infor- 
mation transmission values are the same. In the case where 
c activates gi and g^, but gi can repress 32, qualitatively 
new input/output shapes can be optimal (black lines). Here, 
the combinatorial regulation of 52 by gi and c makes the ap- 
parent input/output relation 32 (c) behave non-monotonically 
and produce a stripe. 



Conclusions 

Biology presents an interesting challenge to physicists: 
many symmetries and simplifications applicable in or- 
dered (but dead) systems are absent in biology, and this 
complexity of life can be intimidating. On the other 
hand, biological systems have evolved for function, and 
as we make progress in formalizing this notion mathe- 
matically, we hope to gain new insights and predictive 
power. Assembling real physical models of biological in- 
formation processing networks and connecting them to 
the genotypes on one hand, and to their function and se- 
lection on the other, will require tools from physics, biol- 
ogy, population dynamics, computer science, information 
theory and other disciplines, and this cross-disciplinary 
nature should make such research attractive to new stu- 
dents. I hope these lecture notes provide one interesting 
entry point to this new and exciting field. 
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